Memory-efficient tensor parallelism for long-sequence Transformer training

Transformer-based models like large language models (LLMs) have attracted significant attention in recent years due to their superior performance. A long sequence of input tokens is essential for industrial LLMs to provide better user services. However, memory consumption increases quadratically wit...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Frontiers of information technology & electronic engineering Ročník 26; číslo 5; s. 770 - 787
Hlavní autoři: Liang, Peng, Qiao, Linbo, Shi, Yanqi, Zheng, Hao, Tang, Yu, Li, Dongsheng
Médium: Journal Article
Jazyk:angličtina
Vydáno: Hangzhou Zhejiang University Press 01.05.2025
Springer Nature B.V
Témata:
ISSN:2095-9184, 2095-9230
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Transformer-based models like large language models (LLMs) have attracted significant attention in recent years due to their superior performance. A long sequence of input tokens is essential for industrial LLMs to provide better user services. However, memory consumption increases quadratically with the increase of sequence length, posing challenges for scaling up long-sequence training. Current parallelism methods produce duplicated tensors during execution, leaving space for improving memory efficiency. Additionally, tensor parallelism (TP) cannot achieve effective overlap between computation and communication. To solve these weaknesses, we propose a general parallelism method called memory-efficient tensor parallelism (METP), designed for the computation of two consecutive matrix multiplications and a possible function between them ( O = f ( AB ) C ), which is the kernel computation component in Transformer training. METP distributes subtasks of computing O to multiple devices and uses send/recv instead of collective communication to exchange submatrices for finishing the computation, avoiding producing duplicated tensors. We also apply the double buffering technique to achieve better overlap between computation and communication. We present the theoretical condition of full overlap to help instruct the long-sequence training of Transformers. Suppose the parallel degree is p ; through theoretical analysis, we prove that METP provides O (1/ p 3 ) memory overhead when not using FlashAttention to compute attention and could save at least 41.7% memory compared to TP when using FlashAttention to compute multi-head self-attention. Our experimental results demonstrate that METP can increase the sequence length by 2.38–2.99 times compared to other methods when using eight A100 graphics processing units (GPUs).
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2095-9184
2095-9230
DOI:10.1631/FITEE.2400602