Spatial-temporal hierarchical decoupled masked autoencoder: A self-supervised learning framework for electrocardiogram

The difficulty of labeling Electrocardiogram (ECG) has prompted researchers to use self-supervised learning to enhance diagnostic performance. Masked autoencoders (MAE) are a mainstream paradigm where models learn a latent representation of the signal by reconstructing masked portions of the ECG. Ho...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Expert systems with applications Ročník 298; s. 129603
Hlavní autoři: Wei, Xiaoyang, Li, Zhiyuan, Tian, Yuanyuan, Wang, Mengxiao, Jin, Yanrui, Ding, Weiping, Liu, Chengliang
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier Ltd 01.03.2026
Témata:
ISSN:0957-4174
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:The difficulty of labeling Electrocardiogram (ECG) has prompted researchers to use self-supervised learning to enhance diagnostic performance. Masked autoencoders (MAE) are a mainstream paradigm where models learn a latent representation of the signal by reconstructing masked portions of the ECG. However, existing methods lack a specific design for the spatial–temporal characteristics of ECG. Specifically, leads represent spatial projections of cardiac activity, while timestamps capture temporal patterns, and the two correspond to different axes of information. Existing MAE frameworks tend to unify them prematurely, potentially weakening critical local dependencies. In this paper, we propose a Spatial-Temporal Hierarchical Decoupled Masked Autoencoder (STHD-MAE). This framework decouples ECG into isolated leads or time steps in the shallow layer to capture local dependencies with different views, then aligns spatial–temporal representations and re-establishes global dependencies in the deep layer to comprehensively represent pathological information. We also design a medical report fusion module during pre-training, which uses cross-attention to align the ECG report text encoded by a medical language model with the signal’s latent representation, thereby guiding the encoder to focus on pathological information through implicit cross-modal learning. We validate the effectiveness of STHD-MAE on multiple downstream classification and reconstruction tasks. The results show that STHD-MAE outperforms existing self-supervised learning methods by approximately 2% in F1-scores for both coarse-grained and fine-grained classification performance, and its reconstruction quality also exceeds the baseline generative model.
ISSN:0957-4174
DOI:10.1016/j.eswa.2025.129603