Vision-to-Voice: AI for generating Description & Audio of Visual Content.

Gespeichert in:
Bibliographische Detailangaben
Titel: Vision-to-Voice: AI for generating Description & Audio of Visual Content.
Autoren: Jayanth, P., Lakshmi Sree, K., Karthik Kumar Reddy, K., Om Prakash, G., Reddy Prasad, G.
Quelle: International Research Journal of Innovations in Engineering & Technology; 2025 Special Issue, Vol. 9, p206-213, 8p
Schlagwörter: LANGUAGE models, SPEECH synthesis, COMPUTER vision, LINGUISTIC models, NATURAL languages
Abstract: The seamless transformation of visual content into descriptive text and naturalistic speech, termed Vision-to-Voice, represents a significant interdisciplinary advancement at the intersection of computer vision, natural language processing (NLP), and speech synthesis. This paper explores the development of an end-to-end Vision-to-Voice pipeline, encompassing visual scene understanding, semantic description generation, and highquality speech synthesis, thereby enabling AI systems to narrate visual content for human users. The proposed methodology integrates Transformer-based image captioning models with context-aware linguistic augmentation and neural vocoders trained for expressive speech synthesis, ensuring fluent and expressive audio descriptions for visual content. While individual advancements in image captioning and TTS are well documented, their seamless fusion into an end-to-end, realtime system presents unique research and engineering challenges, including context preservation across modalities, maintaining linguistic fluency, and ensuring audio naturalness. This paper addresses these gaps through a unified encoder-decoder captioning module with Bahdanau Attention, followed by a Tacotron 2-based Melspectrogram generation module and HiFi-GAN-based waveform synthesis module. Extensive experimentation and evaluations using standard datasets, including Flickr8K and LJSpeech, demonstrate the efficacy of the proposed system in terms of caption quality (BLEU) and audio naturalness (MOS scores). The Vision-to-Voice system holds promising applications in assistive technologies, multimedia enrichment, and automated video annotation systems, thereby contributing to both academic research and real-world accessibility solutions. [ABSTRACT FROM AUTHOR]
Copyright of International Research Journal of Innovations in Engineering & Technology is the property of International Research Journal of Innovations in Engineering & Technology (IRJIET) and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Datenbank: Complementary Index