A Microscaling Multi-Mode Gain-Cell Computing-in-Memory Macro for Advanced AI Edge Device

The microscaling (MX) format is an emerging data representation that quantizes high-bitwidth floating-point (FP) values into low-bitwidth FP-like values with a shared-scale (SS) exponent. When implemented with computing-in-memory (CIM), MX allows an attractive tradeoff between accuracy and hardware...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE journal of solid-state circuits s. 1 - 14
Hlavní autoři: Tien, Jen-Chun, Wu, Ping-Chun, Khwa, Win-San, Sanjay Lele, Ashwin, Su, Jian-Wei, Cheng, Chiao-Yen, Hsu, Jun-Ming, Chen, Yu-Chen, Hsieh, Le-Jung, Bai, Jyun-Cheng, Kao, Yu-Sheng, Lou, Tsung-Han, Wu, Jui-Jen, Lo, Chung-Chuan, Liu, Ren-Shuo, Hsieh, Chih-Cheng, Tang, Kea-Tiong, Chang, Meng-Fan
Médium: Journal Article
Jazyk:angličtina
Vydáno: IEEE 2025
Témata:
ISSN:0018-9200, 1558-173X
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:The microscaling (MX) format is an emerging data representation that quantizes high-bitwidth floating-point (FP) values into low-bitwidth FP-like values with a shared-scale (SS) exponent. When implemented with computing-in-memory (CIM), MX allows an attractive tradeoff between accuracy and hardware efficiency for specific neural network (NN) workloads. This work presents the first multi-mode gain-cell (GC) CIM macro capable of processing MX, integer (INT), and FP multiply-and-accumulate (MAC) operations with high energy efficiency (EEF) and area efficiency (AEF). The proposed macro employs four important innovations: 1) a multi-mode input processing unit (M2-IPU) with SS-variance-aware MAC flow (SS-VAF) for SS processing and SS alignment within the CIM macro to reduce system-to-CIM data transfer and compute energy; 2) a pattern-aware hybrid adder tree (PAH-ADT), which improves EEF and AEF by optimizing the common input patterns; 3) an accumulation-aware data flow (A2-DF) that adjusts the write path based on accumulation size to reduce data transfer energy; and 4) a 3.xT GC, which boosts data retention time (DRT) by increasing parasitic capacitance without additional area overhead. A 16-nm FinFET 216-kb MX-INT-FP multi-mode GC-CIM macro achieved 133.5 TFLOPS/W for MX-MAC with MXINT8 input, MXINT8 weight, and FP32 output; and 91.9 TFLOPS/W for FP-MAC with BF16 input, BF16 weight, and FP32 output.
ISSN:0018-9200
1558-173X
DOI:10.1109/JSSC.2025.3617944