MyLipper: A Personalized System for Speech Reconstruction using Multi-view Visual Feeds

Lipreading is the task of looking at, perceiving, and interpreting spoken symbols. It has a wide range of applications such as in surveillance, Internet telephony, speech reconstruction for silent movies and as an aid to a person with speech as well as hearing impairments. However, most of the work...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:2018 IEEE International Symposium on Multimedia (ISM) S. 159 - 166
Hauptverfasser: Kumar, Yaman, Jain, Rohit, Salik, Mohd, Shah, Rajiv Ratn, Zimmermann, Roger, Yifang Yin
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: IEEE 01.12.2018
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract Lipreading is the task of looking at, perceiving, and interpreting spoken symbols. It has a wide range of applications such as in surveillance, Internet telephony, speech reconstruction for silent movies and as an aid to a person with speech as well as hearing impairments. However, most of the work in lipreading literature has been limited to the classification of speech videos into text classes formed of phrases, words and sentences. Even this has been based on a highly constrained lexicon of words which, then subsequently translates to restriction on total number of classes (i.e, phrases, words and sentences) that are considered for the classification task. Recently, research has ventured into generating speech (audio) from silent video sequences. In spite of non-frontal views showing the potential of enhancing performance of speech reading and reconstruction systems, there have been no developments in using multiple camera feeds for the same. To this end, this paper presents a multi-view speech reading and reconstruction system. The major contribution of this paper is to present a model, namely MyLipper, which is a vocabulary and language agnostic and a real-time model that deals with a variety of poses of a speaker. The model leverages silent video feeds from multiple cameras recording a subject to generate intelligent speech for that speaker, thus being a personalized speech reconstruction model. It uses deep learning based STCNN+BiGRU architecture to achieve this goal. The results obtained using MyLipper show an improvement of over 20% in reconstructed speech's intelligibility (as measured by PESQ) using multiple views as compared to a single view visual feed. This confirms the importance of exploiting multiple views in building an efficient speech reconstruction system. The paper further shows the optimal placement of cameras which would lead to the maximum intelligibility of speech. Further, we demonstrate the reconstructed audios overlaid on the corresponding videos obtained from MyLipper using a variety of videos from the dataset.
AbstractList Lipreading is the task of looking at, perceiving, and interpreting spoken symbols. It has a wide range of applications such as in surveillance, Internet telephony, speech reconstruction for silent movies and as an aid to a person with speech as well as hearing impairments. However, most of the work in lipreading literature has been limited to the classification of speech videos into text classes formed of phrases, words and sentences. Even this has been based on a highly constrained lexicon of words which, then subsequently translates to restriction on total number of classes (i.e, phrases, words and sentences) that are considered for the classification task. Recently, research has ventured into generating speech (audio) from silent video sequences. In spite of non-frontal views showing the potential of enhancing performance of speech reading and reconstruction systems, there have been no developments in using multiple camera feeds for the same. To this end, this paper presents a multi-view speech reading and reconstruction system. The major contribution of this paper is to present a model, namely MyLipper, which is a vocabulary and language agnostic and a real-time model that deals with a variety of poses of a speaker. The model leverages silent video feeds from multiple cameras recording a subject to generate intelligent speech for that speaker, thus being a personalized speech reconstruction model. It uses deep learning based STCNN+BiGRU architecture to achieve this goal. The results obtained using MyLipper show an improvement of over 20% in reconstructed speech's intelligibility (as measured by PESQ) using multiple views as compared to a single view visual feed. This confirms the importance of exploiting multiple views in building an efficient speech reconstruction system. The paper further shows the optimal placement of cameras which would lead to the maximum intelligibility of speech. Further, we demonstrate the reconstructed audios overlaid on the corresponding videos obtained from MyLipper using a variety of videos from the dataset.
Author Kumar, Yaman
Jain, Rohit
Salik, Mohd
Shah, Rajiv Ratn
Yifang Yin
Zimmermann, Roger
Author_xml – sequence: 1
  givenname: Yaman
  surname: Kumar
  fullname: Kumar, Yaman
  email: ykumar@adobe.com
– sequence: 2
  givenname: Rohit
  surname: Jain
  fullname: Jain, Rohit
  email: rohitj.co@nsit.net.in
  organization: MIDAS Lab., Univ. of Delhi, Delhi, India
– sequence: 3
  givenname: Mohd
  surname: Salik
  fullname: Salik, Mohd
  email: khwajam.co@nsit.net.in
  organization: Indraprastha Inst. of Inf. Technol., New Delhi, India
– sequence: 4
  givenname: Rajiv Ratn
  surname: Shah
  fullname: Shah, Rajiv Ratn
  email: rajivratn@iiitd.ac.in
  organization: Indraprastha Inst. of Inf. Technol., New Delhi, India
– sequence: 5
  givenname: Roger
  surname: Zimmermann
  fullname: Zimmermann, Roger
  email: rogerz@comp.nus.edu.sg
  organization: Nat. Univ. of Singapore, Singapore, Singapore
– sequence: 6
  surname: Yifang Yin
  fullname: Yifang Yin
  email: yifang@comp.nus.edu.sg
  organization: Nat. Univ. of Singapore, Singapore, Singapore
BookMark eNotzstOhDAUgOGaaKKObN246QuApxR6cTeZOOMkEI14WU6Y9qA1DJAWNPj0mujq3335z8lx13dIyCWDhDHQ19uqTFJgKgGImT4ikZaK5VwJoXLJT0kUwgcApEJlDNQZeS3nwg0D-hu6pA_oQ9_VrftGS6s5jHigTe9pNSCad_qIpu_C6Cczur6jU3DdGy2ndnTxp8Mv-uLCVLd0jWjDBTlp6jZg9N8FeV7fPq3u4uJ-s10ti9gxmY9xY_VeGp4pqxumlUmzPQqBEqxoGi6tlAIBOQJok-ZMMW4Af89TzAUKK_iCXP25DhF3g3eH2s87JYCnUvIfuHxRng
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ISM.2018.00-19
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9781538668573
1538668572
EndPage 166
ExternalDocumentID 8603277
Genre orig-research
GroupedDBID 6IE
6IF
6IL
6IN
AAJGR
AAWTH
ABLEC
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
IEGSK
OCL
RIB
RIC
RIE
RIL
ID FETCH-LOGICAL-i175t-fd9b7c348d9f198c24be66e70d6ff37d776e0e3e009c251813c0e4102e56e6d63
IEDL.DBID RIE
ISICitedReferencesCount 13
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000459863600028&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 03:03:48 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i175t-fd9b7c348d9f198c24be66e70d6ff37d776e0e3e009c251813c0e4102e56e6d63
PageCount 8
ParticipantIDs ieee_primary_8603277
PublicationCentury 2000
PublicationDate 2018-Dec
PublicationDateYYYYMMDD 2018-12-01
PublicationDate_xml – month: 12
  year: 2018
  text: 2018-Dec
PublicationDecade 2010
PublicationTitle 2018 IEEE International Symposium on Multimedia (ISM)
PublicationTitleAbbrev ISM
PublicationYear 2018
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0002684108
Score 1.8100963
Snippet Lipreading is the task of looking at, perceiving, and interpreting spoken symbols. It has a wide range of applications such as in surveillance, Internet...
SourceID ieee
SourceType Publisher
StartPage 159
SubjectTerms Cameras
Feature extraction
Feeds
Hidden Markov models
Linear predictive coding
Multi-layer neural network
Signal reconstruction
Speech synthesis
Task analysis
Videos
Visualization
Title MyLipper: A Personalized System for Speech Reconstruction using Multi-view Visual Feeds
URI https://ieeexplore.ieee.org/document/8603277
WOSCitedRecordID wos000459863600028&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NTwIxEJ0A8eBJDRi_04NHK-2WbbvejJFoIoQEP7iR3XZWSQwQPkz019vZRfDgxVvTS9M3hzftvHkDcB5ox8kkU9wijTDDNOapSz1XGOjAi9zq0iTpwXS7djBIehW4WPfCIGIhPsNLWha1fD9xS_oqa1otVGRMFarG6LJXa_2fQq4lUtiVL6MUSfO-3yHpFmklORnp_JqeUpBHe-d_x-5CY9OFx3prftmDCo7r8NL5JE8FnF2xa9b7yaS_0LPSe5yFJJT1p4jujdHTcmMQy0ji_sqKjltOBQH2PJov03fWDgfMG_DUvn28ueOr4Qh8FBh_wXOfZMaplvVJLhProlaGWqMRXue5Mj4ghAIVhhzKhRzGSuUEBoQijDVqr9U-1MaTMR4AMy4OgYxcHGfk7uVTVGl4aWSRzGyUtuQh1AmU4bT0vxiu8Dj6e_sYtgn1UvJxArVwSzyFLfexGM1nZ0XQvgHeDpmy
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ09T8MwEIZPpSDBBKhFfOOBEdMkbmyHDSEqKtqqUgt0qxL7ApVQW_UDCX49viS0DCxsURbLd8N7tt97DuDSyY7xo0RwjTTCDOOQxya2XKCTA-ulWuaQpJbqdPRgEHVLcLXqhUHEzHyG1_SZveXbiVnSVVlNS08ESm3AJk3OKrq1VjcqxC3xPV2QGX0vqjV7bTJvkVuSE0rn1_yUTD4au_9beA-q6z481l0pzD6UcFyBl_YnURVwdsNuWfenlv5Cy3L6OHNlKOtNEc0bo8PlGhHLyOT-yrKeW05PAux5NF_G76zhFphX4alx37974MV4BD5ymr_gqY0SZURd2yj1I22CeoJSovKsTFOhrFISPRToqijjqhjtC-Ohi1CAoURppTiA8ngyxkNgyoQulYEJw4T4XjZGEbuzRhL4iQ7iun8EFQrKcJoTMIZFPI7__n0B2w_9dmvYanYeT2CHMpAbQE6h7HaMZ7BlPhaj-ew8S-A3D2ec-w
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE+International+Symposium+on+Multimedia+%28ISM%29&rft.atitle=MyLipper%3A+A+Personalized+System+for+Speech+Reconstruction+using+Multi-view+Visual+Feeds&rft.au=Kumar%2C+Yaman&rft.au=Jain%2C+Rohit&rft.au=Salik%2C+Mohd&rft.au=Shah%2C+Rajiv+Ratn&rft.date=2018-12-01&rft.pub=IEEE&rft.spage=159&rft.epage=166&rft_id=info:doi/10.1109%2FISM.2018.00-19&rft.externalDocID=8603277