A 7-nm Four-Core Mixed-Precision AI Chip with 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling

Sae Kyu Lee; Ankur Agrawal; Joel Silberman; Matthew Ziegler; Mingu Kang; Swagath Venkataramani; Nianzheng Cao; Bruce Fleischer; Michael Guillorn; Matthew Cohen; Silvia M. Mueller; Jinwook Oh; Martin Lutz; Jinwook Jung; Siyu Koswatta; Ching Zhou; Vidhi Zalani; Monodeep Kar; James Bonanno; Robert Casatuta; Chia-Yu Chen; Jungwook Choi; Howard Haynie; Alyssa Herbert; Radhika Jain; Kyu-Hyoun Kim; Yulong Li; Zhibin Ren; Scot Rider; Marcel Schaal; Kerstin Schelm; Michael R. Scheuermann; Xiao Sun; Hung Tran; Naigang Wang; Wei Wang; Xin Zhang; Vinay Shah; Brian Curran; Vijayalakshmi Srinivasan; Pong-Fei Lu; Sunil Shukla; Kailash Gopalakrishnan; Leland Chang

doi:10.1109/JSSC.2021.3120113

IEEE JSSC

Paper

31 Dec 2021

A 7-nm Four-Core Mixed-Precision AI Chip with 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling

View publication

Abstract

Reduced precision computation is a key enabling factor for energy-efficient acceleration of deep learning (DL) applications. This article presents a 7-nm four-core mixed-precision artificial intelligence (AI) chip that supports four compute precisions - FP16, Hybrid-FP8 (HFP8), INT4, and INT2 - to support diverse application demands for training and inference. The chip leverages cutting-edge algorithmic advances to demonstrate leading-edge power efficiency for 8-bit floating-point (FP8) training and INT4 inference without model accuracy degradation. A new HFP8 format combined with separation of the floating- and fixed-point pipelines and aggressive circuit/architecture optimization enables performance improvements while maintaining high compute utilization. A high-bandwidth ring protocol enables efficient data communication, while power management using workload-aware clock throttling maximizes performance within a given power budget. The AI chip demonstrates 3.58-TFLOPS/W peak energy efficiency and 26.2-TFLOPS peak performance for HFP8 iso-accuracy training, and 16.9-TOPS/W peak energy efficiency and 104.9-TOPS peak performance for INT4 iso-accuracy inference.

Paper