Audio Retrieval With Natural Language Queries: A Benchmark Study

The objectives of this work are cross-modal text-audio and audio-text retrieval , in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval enables users to search large databases through an intuiti...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	IEEE transactions on multimedia Ročník 25; s. 2675 - 2685
Hlavní autoři:	Koepke, A. Sophia, Oncescu, Andreea-Maria, Henriques, Joao F., Akata, Zeynep, Albanie, Samuel
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Piscataway IEEE 2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:	Audio retrieval Benchmark testing Benchmarks datasets Descriptions Free form Grounding Metadata Natural language Natural languages Queries Retrieval Task analysis text-based retrieval Visual databases Visualization
ISSN:	1520-9210, 1941-0077
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	The objectives of this work are cross-modal text-audio and audio-text retrieval , in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval enables users to search large databases through an intuitive interface: they simply issue free-form natural language descriptions of the sound they would like to hear. To study the tasks of text-audio and audio-text retrieval, which have received limited attention in the existing literature, we introduce three challenging new benchmarks. We first construct text-audio and audio-text retrieval benchmarks from the AudioCaps and Clotho audio captioning datasets. Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho . We employ these three benchmarks to establish baselines for cross-modal text-audio and audio-text retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our benchmarks will inspire further research into audio retrieval with free-form text queries.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2022.3149712