Spatial-temporal hierarchical decoupled masked autoencoder: A self-supervised learning framework for electrocardiogram

The difficulty of labeling Electrocardiogram (ECG) has prompted researchers to use self-supervised learning to enhance diagnostic performance. Masked autoencoders (MAE) are a mainstream paradigm where models learn a latent representation of the signal by reconstructing masked portions of the ECG. Ho...

Full description

Saved in:
Bibliographic Details
Published in:Expert systems with applications Vol. 298; p. 129603
Main Authors: Wei, Xiaoyang, Li, Zhiyuan, Tian, Yuanyuan, Wang, Mengxiao, Jin, Yanrui, Ding, Weiping, Liu, Chengliang
Format: Journal Article
Language:English
Published: Elsevier Ltd 01.03.2026
Subjects:
ISSN:0957-4174
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The difficulty of labeling Electrocardiogram (ECG) has prompted researchers to use self-supervised learning to enhance diagnostic performance. Masked autoencoders (MAE) are a mainstream paradigm where models learn a latent representation of the signal by reconstructing masked portions of the ECG. However, existing methods lack a specific design for the spatial–temporal characteristics of ECG. Specifically, leads represent spatial projections of cardiac activity, while timestamps capture temporal patterns, and the two correspond to different axes of information. Existing MAE frameworks tend to unify them prematurely, potentially weakening critical local dependencies. In this paper, we propose a Spatial-Temporal Hierarchical Decoupled Masked Autoencoder (STHD-MAE). This framework decouples ECG into isolated leads or time steps in the shallow layer to capture local dependencies with different views, then aligns spatial–temporal representations and re-establishes global dependencies in the deep layer to comprehensively represent pathological information. We also design a medical report fusion module during pre-training, which uses cross-attention to align the ECG report text encoded by a medical language model with the signal’s latent representation, thereby guiding the encoder to focus on pathological information through implicit cross-modal learning. We validate the effectiveness of STHD-MAE on multiple downstream classification and reconstruction tasks. The results show that STHD-MAE outperforms existing self-supervised learning methods by approximately 2% in F1-scores for both coarse-grained and fine-grained classification performance, and its reconstruction quality also exceeds the baseline generative model.
ISSN:0957-4174
DOI:10.1016/j.eswa.2025.129603