Multi-task Piano Transcription with Local Relative Time Attention

Automatic music transcription (AMT) is to transcribe music audios into musical symbol representations. Recently, the Transformer-based transcription systems have shown superiority on modeling note-wise sequences. For the frame-wise transcription targets in the AMT, the attention needs to focus more...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Proceedings ... Asia-Pacific Signal and Information Processing Association Annual Summit and Conference APSIPA ASC ... (Online) s. 966 - 971
Hlavní autoři: Wang, Qi, Liu, Mingkuan, Chen, Xianhong, Xiong, Mengwen
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 31.10.2023
Témata:
ISSN:2640-0103
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Automatic music transcription (AMT) is to transcribe music audios into musical symbol representations. Recently, the Transformer-based transcription systems have shown superiority on modeling note-wise sequences. For the frame-wise transcription targets in the AMT, the attention needs to focus more on the neighboring frames instead of notes in context. In this work, we propose a multi-task transcription system with a self-attention mechanism. The designed relative positional self-attention aims to model frame-wise short-term dependencies in audio and transcribe music of variable length. Adding the learnable attention mask on multiple attention head, the network can obtain different multi-scale attention distances for each subtask. Experiments on the MAESTRO dataset show the proposed system with the local relative time attention mechanism achieves state-of-the-art transcription performance on both frame and note metrics (frame F1 93.40%, note with offset F1 88.50%).
ISSN:2640-0103
DOI:10.1109/APSIPAASC58517.2023.10317104