Classification and study of music genres with multimodal Spectro-Lyrical Embeddings for Music (SLEM)

The essence of music is inherently multi-modal – with audio and lyrics going hand in hand. However, there is very less research done to study the intricacies of the multi-modal nature of music, and its relation with genres. Our work uses this multi-modality to present spectro-lyrical embeddings for...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Multimedia tools and applications Ročník 84; číslo 7; s. 3701 - 3721
Hlavní autoři:	Mehra, Ashman, Mehra, Aryan, Narang, Pratik
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	New York Springer US 01.02.2025 Springer Nature B.V
Témata:	Algorithms Classification Clustering Computer Communication Networks Computer Science Data Structures and Information Theory Deep learning Genre Machine learning Multimedia Information Systems Music Special Purpose and Application-Based Systems Spectrograms Representation learning Multimodal music embeddings Music Spectrograms Machine learning
ISSN:	1573-7721, 1380-7501, 1573-7721
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	The essence of music is inherently multi-modal – with audio and lyrics going hand in hand. However, there is very less research done to study the intricacies of the multi-modal nature of music, and its relation with genres. Our work uses this multi-modality to present spectro-lyrical embeddings for music representation (SLEM), leveraging the power of open-sourced, lightweight, and state-of-the-art deep learning vision and language models to encode songs. This work summarises extensive experimentation with over 20 deep learning-based music embeddings of a self-curated and hand-labeled multi-lingual dataset of 226 recent songs spread over 5 genres. Our aim is to study the effects of varying the weight of lyrics and spectrograms in the embeddings on the multi-class genre classification. The purpose of this study is to prove that a simple linear combination of both modalities is better than either modality alone. Our methods achieve an accuracy ranging between 81.08% to 98.60% for different genres, by using the K-nearest neighbors algorithm on the multimodal embeddings. We successfully study the intricacies of genres in this representational space, including their misclassification, visual clustering with EM-GMM, and the domain-specific meaning of the multi-modal weight for each genre with respect to ’instrumentalness’ and ’energy’ metadata. SLEM presents one of the first works on an end-to-end method that uses spectro-lyrical embeddings without hand-engineered features.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1573-7721 1380-7501 1573-7721
DOI:	10.1007/s11042-024-19160-5