Multi-task Piano Transcription with Local Relative Time Attention

Automatic music transcription (AMT) is to transcribe music audios into musical symbol representations. Recently, the Transformer-based transcription systems have shown superiority on modeling note-wise sequences. For the frame-wise transcription targets in the AMT, the attention needs to focus more...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Proceedings ... Asia-Pacific Signal and Information Processing Association Annual Summit and Conference APSIPA ASC ... (Online) S. 966 - 971
Hauptverfasser: Wang, Qi, Liu, Mingkuan, Chen, Xianhong, Xiong, Mengwen
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: IEEE 31.10.2023
Schlagworte:
ISSN:2640-0103
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Automatic music transcription (AMT) is to transcribe music audios into musical symbol representations. Recently, the Transformer-based transcription systems have shown superiority on modeling note-wise sequences. For the frame-wise transcription targets in the AMT, the attention needs to focus more on the neighboring frames instead of notes in context. In this work, we propose a multi-task transcription system with a self-attention mechanism. The designed relative positional self-attention aims to model frame-wise short-term dependencies in audio and transcribe music of variable length. Adding the learnable attention mask on multiple attention head, the network can obtain different multi-scale attention distances for each subtask. Experiments on the MAESTRO dataset show the proposed system with the local relative time attention mechanism achieves state-of-the-art transcription performance on both frame and note metrics (frame F1 93.40%, note with offset F1 88.50%).
ISSN:2640-0103
DOI:10.1109/APSIPAASC58517.2023.10317104