MyLipper: A Personalized System for Speech Reconstruction using Multi-view Visual Feeds
Lipreading is the task of looking at, perceiving, and interpreting spoken symbols. It has a wide range of applications such as in surveillance, Internet telephony, speech reconstruction for silent movies and as an aid to a person with speech as well as hearing impairments. However, most of the work...
Gespeichert in:
| Veröffentlicht in: | 2018 IEEE International Symposium on Multimedia (ISM) S. 159 - 166 |
|---|---|
| Hauptverfasser: | , , , , , |
| Format: | Tagungsbericht |
| Sprache: | Englisch |
| Veröffentlicht: |
IEEE
01.12.2018
|
| Schlagworte: | |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | Lipreading is the task of looking at, perceiving, and interpreting spoken symbols. It has a wide range of applications such as in surveillance, Internet telephony, speech reconstruction for silent movies and as an aid to a person with speech as well as hearing impairments. However, most of the work in lipreading literature has been limited to the classification of speech videos into text classes formed of phrases, words and sentences. Even this has been based on a highly constrained lexicon of words which, then subsequently translates to restriction on total number of classes (i.e, phrases, words and sentences) that are considered for the classification task. Recently, research has ventured into generating speech (audio) from silent video sequences. In spite of non-frontal views showing the potential of enhancing performance of speech reading and reconstruction systems, there have been no developments in using multiple camera feeds for the same. To this end, this paper presents a multi-view speech reading and reconstruction system. The major contribution of this paper is to present a model, namely MyLipper, which is a vocabulary and language agnostic and a real-time model that deals with a variety of poses of a speaker. The model leverages silent video feeds from multiple cameras recording a subject to generate intelligent speech for that speaker, thus being a personalized speech reconstruction model. It uses deep learning based STCNN+BiGRU architecture to achieve this goal. The results obtained using MyLipper show an improvement of over 20% in reconstructed speech's intelligibility (as measured by PESQ) using multiple views as compared to a single view visual feed. This confirms the importance of exploiting multiple views in building an efficient speech reconstruction system. The paper further shows the optimal placement of cameras which would lead to the maximum intelligibility of speech. Further, we demonstrate the reconstructed audios overlaid on the corresponding videos obtained from MyLipper using a variety of videos from the dataset. |
|---|---|
| AbstractList | Lipreading is the task of looking at, perceiving, and interpreting spoken symbols. It has a wide range of applications such as in surveillance, Internet telephony, speech reconstruction for silent movies and as an aid to a person with speech as well as hearing impairments. However, most of the work in lipreading literature has been limited to the classification of speech videos into text classes formed of phrases, words and sentences. Even this has been based on a highly constrained lexicon of words which, then subsequently translates to restriction on total number of classes (i.e, phrases, words and sentences) that are considered for the classification task. Recently, research has ventured into generating speech (audio) from silent video sequences. In spite of non-frontal views showing the potential of enhancing performance of speech reading and reconstruction systems, there have been no developments in using multiple camera feeds for the same. To this end, this paper presents a multi-view speech reading and reconstruction system. The major contribution of this paper is to present a model, namely MyLipper, which is a vocabulary and language agnostic and a real-time model that deals with a variety of poses of a speaker. The model leverages silent video feeds from multiple cameras recording a subject to generate intelligent speech for that speaker, thus being a personalized speech reconstruction model. It uses deep learning based STCNN+BiGRU architecture to achieve this goal. The results obtained using MyLipper show an improvement of over 20% in reconstructed speech's intelligibility (as measured by PESQ) using multiple views as compared to a single view visual feed. This confirms the importance of exploiting multiple views in building an efficient speech reconstruction system. The paper further shows the optimal placement of cameras which would lead to the maximum intelligibility of speech. Further, we demonstrate the reconstructed audios overlaid on the corresponding videos obtained from MyLipper using a variety of videos from the dataset. |
| Author | Kumar, Yaman Jain, Rohit Salik, Mohd Shah, Rajiv Ratn Yifang Yin Zimmermann, Roger |
| Author_xml | – sequence: 1 givenname: Yaman surname: Kumar fullname: Kumar, Yaman email: ykumar@adobe.com – sequence: 2 givenname: Rohit surname: Jain fullname: Jain, Rohit email: rohitj.co@nsit.net.in organization: MIDAS Lab., Univ. of Delhi, Delhi, India – sequence: 3 givenname: Mohd surname: Salik fullname: Salik, Mohd email: khwajam.co@nsit.net.in organization: Indraprastha Inst. of Inf. Technol., New Delhi, India – sequence: 4 givenname: Rajiv Ratn surname: Shah fullname: Shah, Rajiv Ratn email: rajivratn@iiitd.ac.in organization: Indraprastha Inst. of Inf. Technol., New Delhi, India – sequence: 5 givenname: Roger surname: Zimmermann fullname: Zimmermann, Roger email: rogerz@comp.nus.edu.sg organization: Nat. Univ. of Singapore, Singapore, Singapore – sequence: 6 surname: Yifang Yin fullname: Yifang Yin email: yifang@comp.nus.edu.sg organization: Nat. Univ. of Singapore, Singapore, Singapore |
| BookMark | eNotzstOhDAUgOGaaKKObN246QuApxR6cTeZOOMkEI14WU6Y9qA1DJAWNPj0mujq3335z8lx13dIyCWDhDHQ19uqTFJgKgGImT4ikZaK5VwJoXLJT0kUwgcApEJlDNQZeS3nwg0D-hu6pA_oQ9_VrftGS6s5jHigTe9pNSCad_qIpu_C6Cczur6jU3DdGy2ndnTxp8Mv-uLCVLd0jWjDBTlp6jZg9N8FeV7fPq3u4uJ-s10ti9gxmY9xY_VeGp4pqxumlUmzPQqBEqxoGi6tlAIBOQJok-ZMMW4Af89TzAUKK_iCXP25DhF3g3eH2s87JYCnUvIfuHxRng |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/ISM.2018.00-19 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9781538668573 1538668572 |
| EndPage | 166 |
| ExternalDocumentID | 8603277 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IF 6IL 6IN AAJGR AAWTH ABLEC ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK IEGSK OCL RIB RIC RIE RIL |
| ID | FETCH-LOGICAL-i175t-fd9b7c348d9f198c24be66e70d6ff37d776e0e3e009c251813c0e4102e56e6d63 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 13 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000459863600028&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 03:03:48 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i175t-fd9b7c348d9f198c24be66e70d6ff37d776e0e3e009c251813c0e4102e56e6d63 |
| PageCount | 8 |
| ParticipantIDs | ieee_primary_8603277 |
| PublicationCentury | 2000 |
| PublicationDate | 2018-Dec |
| PublicationDateYYYYMMDD | 2018-12-01 |
| PublicationDate_xml | – month: 12 year: 2018 text: 2018-Dec |
| PublicationDecade | 2010 |
| PublicationTitle | 2018 IEEE International Symposium on Multimedia (ISM) |
| PublicationTitleAbbrev | ISM |
| PublicationYear | 2018 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0002684108 |
| Score | 1.8100963 |
| Snippet | Lipreading is the task of looking at, perceiving, and interpreting spoken symbols. It has a wide range of applications such as in surveillance, Internet... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 159 |
| SubjectTerms | Cameras Feature extraction Feeds Hidden Markov models Linear predictive coding Multi-layer neural network Signal reconstruction Speech synthesis Task analysis Videos Visualization |
| Title | MyLipper: A Personalized System for Speech Reconstruction using Multi-view Visual Feeds |
| URI | https://ieeexplore.ieee.org/document/8603277 |
| WOSCitedRecordID | wos000459863600028&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NTwIxEJ0A8eBJDRi_04NHK-2WbbvejJFoIoQEP7iR3XZWSQwQPkz019vZRfDgxVvTS9M3hzftvHkDcB5ox8kkU9wijTDDNOapSz1XGOjAi9zq0iTpwXS7djBIehW4WPfCIGIhPsNLWha1fD9xS_oqa1otVGRMFarG6LJXa_2fQq4lUtiVL6MUSfO-3yHpFmklORnp_JqeUpBHe-d_x-5CY9OFx3prftmDCo7r8NL5JE8FnF2xa9b7yaS_0LPSe5yFJJT1p4jujdHTcmMQy0ji_sqKjltOBQH2PJov03fWDgfMG_DUvn28ueOr4Qh8FBh_wXOfZMaplvVJLhProlaGWqMRXue5Mj4ghAIVhhzKhRzGSuUEBoQijDVqr9U-1MaTMR4AMy4OgYxcHGfk7uVTVGl4aWSRzGyUtuQh1AmU4bT0vxiu8Dj6e_sYtgn1UvJxArVwSzyFLfexGM1nZ0XQvgHeDpmy |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ09T8MwEIZPpSDBBKhFfOOBEdMkbmyHDSEqKtqqUgt0qxL7ApVQW_UDCX49viS0DCxsURbLd8N7tt97DuDSyY7xo0RwjTTCDOOQxya2XKCTA-ulWuaQpJbqdPRgEHVLcLXqhUHEzHyG1_SZveXbiVnSVVlNS08ESm3AJk3OKrq1VjcqxC3xPV2QGX0vqjV7bTJvkVuSE0rn1_yUTD4au_9beA-q6z481l0pzD6UcFyBl_YnURVwdsNuWfenlv5Cy3L6OHNlKOtNEc0bo8PlGhHLyOT-yrKeW05PAux5NF_G76zhFphX4alx37974MV4BD5ymr_gqY0SZURd2yj1I22CeoJSovKsTFOhrFISPRToqijjqhjtC-Ohi1CAoURppTiA8ngyxkNgyoQulYEJw4T4XjZGEbuzRhL4iQ7iun8EFQrKcJoTMIZFPI7__n0B2w_9dmvYanYeT2CHMpAbQE6h7HaMZ7BlPhaj-ew8S-A3D2ec-w |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE+International+Symposium+on+Multimedia+%28ISM%29&rft.atitle=MyLipper%3A+A+Personalized+System+for+Speech+Reconstruction+using+Multi-view+Visual+Feeds&rft.au=Kumar%2C+Yaman&rft.au=Jain%2C+Rohit&rft.au=Salik%2C+Mohd&rft.au=Shah%2C+Rajiv+Ratn&rft.date=2018-12-01&rft.pub=IEEE&rft.spage=159&rft.epage=166&rft_id=info:doi/10.1109%2FISM.2018.00-19&rft.externalDocID=8603277 |