Audio-to-Deep-Lip: Speaking lip synthesis based on 3D landmarks

Generating talking lips in sync with input speech has the potential to enhance speech communication and enable novel applications. This paper presents a system that can generate accurate 3D talking lips, readily applicable to unseen subjects and different languages. The developed head-mounted facial...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Computers & graphics Ročník 120; s. 103925
Hlavní autori: Fang, Hui, Weng, Dongdong, Tian, Zeyu, Ma, Yin, Lu, Xiangju
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Elsevier Ltd 01.05.2024
Predmet:
ISSN:0097-8493
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Generating talking lips in sync with input speech has the potential to enhance speech communication and enable novel applications. This paper presents a system that can generate accurate 3D talking lips, readily applicable to unseen subjects and different languages. The developed head-mounted facial acquisition device and automated data processing pipeline can generate precise landmarks while mitigating the difficulty of acquiring 3D facial data. Our system consists of three stages to generate accurate lip movements. In the first stage, the fine-tuned Wav2Vec2.0+Transformer captures long-range audio context dependencies. In the second stage, we propose the Viseme Fixing method, which significantly improves lip accuracy at/b//p//m//f/ phonemes. In the last stage, we innovatively use the structural relationship between the inner and outer lips and learn to map the outer lip landmarks to the inner lip landmarks. Subjective evaluations show that the generated talking lips match the input audio significantly. We demonstrate two applications that animate 2D face videos and 3D face models using our landmarks. The precise lip landmarks allow the generated animations to exceed the results of state-of-the-art methods. [Display omitted] •Generate accurate 3D talking lips for unseen subjects and different languages.•Develop a head-mounted facial acquisition device to acquire precise 3D facial data.•Propose Viseme Fixing algorithm to enhance the lip accuracy of /b//p//m//f/ visemes.•Make two applications that bridge 3D landmarks to talking videos and talking meshes.
AbstractList Generating talking lips in sync with input speech has the potential to enhance speech communication and enable novel applications. This paper presents a system that can generate accurate 3D talking lips, readily applicable to unseen subjects and different languages. The developed head-mounted facial acquisition device and automated data processing pipeline can generate precise landmarks while mitigating the difficulty of acquiring 3D facial data. Our system consists of three stages to generate accurate lip movements. In the first stage, the fine-tuned Wav2Vec2.0+Transformer captures long-range audio context dependencies. In the second stage, we propose the Viseme Fixing method, which significantly improves lip accuracy at/b//p//m//f/ phonemes. In the last stage, we innovatively use the structural relationship between the inner and outer lips and learn to map the outer lip landmarks to the inner lip landmarks. Subjective evaluations show that the generated talking lips match the input audio significantly. We demonstrate two applications that animate 2D face videos and 3D face models using our landmarks. The precise lip landmarks allow the generated animations to exceed the results of state-of-the-art methods. [Display omitted] •Generate accurate 3D talking lips for unseen subjects and different languages.•Develop a head-mounted facial acquisition device to acquire precise 3D facial data.•Propose Viseme Fixing algorithm to enhance the lip accuracy of /b//p//m//f/ visemes.•Make two applications that bridge 3D landmarks to talking videos and talking meshes.
ArticleNumber 103925
Author Fang, Hui
Tian, Zeyu
Weng, Dongdong
Lu, Xiangju
Ma, Yin
Author_xml – sequence: 1
  givenname: Hui
  orcidid: 0009-0002-3505-3308
  surname: Fang
  fullname: Fang, Hui
  organization: Beijing Engineering Research Center of Mixed Reality and Advanced Display, School of Optics and Photonics, Beijing Institute of Technology, No. 5 Zhongguancun South Street, Beijing, 100081, China
– sequence: 2
  givenname: Dongdong
  surname: Weng
  fullname: Weng, Dongdong
  email: crgj@bit.edu.cn
  organization: Beijing Engineering Research Center of Mixed Reality and Advanced Display, School of Optics and Photonics, Beijing Institute of Technology, No. 5 Zhongguancun South Street, Beijing, 100081, China
– sequence: 3
  givenname: Zeyu
  surname: Tian
  fullname: Tian, Zeyu
  organization: Beijing Engineering Research Center of Mixed Reality and Advanced Display, School of Optics and Photonics, Beijing Institute of Technology, No. 5 Zhongguancun South Street, Beijing, 100081, China
– sequence: 4
  givenname: Yin
  surname: Ma
  fullname: Ma, Yin
  organization: Ningxia Baofeng Group Co. LTD., No. 19, International Trade City, Yinchuan, Ningxia, 750003, China
– sequence: 5
  givenname: Xiangju
  surname: Lu
  fullname: Lu, Xiangju
  organization: iQIYI Inc., Hongcheng Building, No. 2 North First Street, Haidian, Beijing, 100027, China
BookMark eNp9z8FOAjEQgOEeMBHQB_DWFyi23W6X1YMhIGpC4kHuTbedYmFtN9tqwtu7BE8eOE3m8E_mm6BRiAEQumN0xiiT9_uZ0bsZp1wMe1HzcoTGlNYVmYu6uEaTlPaUUs6lGKOnxbf1keRIVgAd2fjuAX90oA8-7HDrO5yOIX9C8gk3OoHFMeBihVsd7JfuD-kGXTndJrj9m1O0XT9vl69k8_7ytlxsiOF1lYkBXtHaFY1h0jGmXUV1DaYSpRTOyaa03ErbmPm8oY0rHTeGllpLwRkIKYspYuezpo8p9eBU1_vhgaNiVJ3Qaq8GtDqh1Rk9NNW_xviss48h99q3F8vHcwmD6MdDr5LxEAxY34PJykZ_of4Ft4502Q
CitedBy_id crossref_primary_10_1016_j_cag_2024_103960
Cites_doi 10.1109/TMM.2021.3099900
10.1145/3382507.3418840
10.1109/CVPR.2019.01034
10.1145/3478513.3480484
10.1007/978-3-030-58545-7_3
10.1145/3306346.3323028
10.1145/3072959.3073658
10.1109/ICCV.2017.116
10.1109/CVPRW.2017.287
10.1109/ICCV48922.2021.00384
10.1145/258734.258880
10.1109/CVPR.2019.00802
10.1109/TASLP.2021.3122291
10.1145/311535.311556
10.1007/978-3-030-01234-2_32
10.1145/3550469.3555399
10.1109/CVPR52688.2022.01821
10.1006/cviu.1995.1004
10.1109/TASLP.2019.2947741
10.1007/978-3-031-43148-7_29
10.1145/3429341.3429356
10.1109/CVPR.2018.00537
10.1109/TMM.2010.2052239
10.1109/CVPR46437.2021.00416
10.1145/3394171.3413532
10.1145/3449063
10.1145/3197517.3201292
10.1145/2897824.2925984
10.1145/3072959.3073640
ContentType Journal Article
Copyright 2024 Elsevier Ltd
Copyright_xml – notice: 2024 Elsevier Ltd
DBID AAYXX
CITATION
DOI 10.1016/j.cag.2024.103925
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
ExternalDocumentID 10_1016_j_cag_2024_103925
S0097849324000608
GroupedDBID --K
--M
-~X
.DC
.~1
0R~
1B1
1~.
1~5
29F
4.4
457
4G.
5GY
5VS
6TJ
7-5
71M
8P~
9JN
AACTN
AAEDT
AAEDW
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AATTM
AAXKI
AAXUO
AAYFN
ABAOU
ABBOA
ABDPE
ABEFU
ABJNI
ABMAC
ABTAH
ABWVN
ABXDB
ACDAQ
ACGFS
ACNNM
ACRLP
ACRPL
ACZNC
ADBBV
ADEZE
ADGUI
ADJOM
ADMUD
ADNMO
AEBSH
AEIPS
AEKER
AFFNX
AFJKZ
AFTJW
AGHFR
AGSOS
AGUBO
AGYEJ
AHHHB
AHZHX
AI.
AIALX
AIEXJ
AIGVJ
AIKHN
AITUG
ALMA_UNASSIGNED_HOLDINGS
AMRAJ
ANKPU
AOUOD
ARUGR
ASPBG
AVWKF
AXJTR
AZFZN
BKOJK
BLXMC
BNPGV
CS3
EBS
EFJIC
EJD
EO8
EO9
EP2
EP3
F5P
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-Q
GBLVA
GBOLZ
HLZ
HVGLF
HZ~
H~9
IHE
J1W
K-O
KOM
LG9
M41
MHUIS
MO0
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
PC.
Q38
R2-
RIG
ROL
RPZ
SBC
SDF
SDG
SDP
SES
SEW
SPC
SPCBC
SSH
SSV
SSW
SSZ
T5K
TN5
UHS
VH1
VOH
WH7
WUQ
XPP
ZMT
ZY4
~02
~G-
9DU
AAYWO
AAYXX
ACLOT
AGQPQ
AIIUN
APXCP
CITATION
EFKBS
EFLBG
~HD
ID FETCH-LOGICAL-c297t-ce2709f3bc16f11af70a9ec74564ff6b5d2d6dbc88b0bf5f2cc05aa6421e4663
ISICitedReferencesCount 1
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001242618700001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0097-8493
IngestDate Sat Nov 29 05:33:30 EST 2025
Tue Nov 18 20:47:30 EST 2025
Sun Apr 06 06:54:33 EDT 2025
IsPeerReviewed true
IsScholarly true
Keywords Landmarks
3D talking meshes
Viseme
Lip animation
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c297t-ce2709f3bc16f11af70a9ec74564ff6b5d2d6dbc88b0bf5f2cc05aa6421e4663
ORCID 0009-0002-3505-3308
ParticipantIDs crossref_primary_10_1016_j_cag_2024_103925
crossref_citationtrail_10_1016_j_cag_2024_103925
elsevier_sciencedirect_doi_10_1016_j_cag_2024_103925
PublicationCentury 2000
PublicationDate May 2024
2024-05-00
PublicationDateYYYYMMDD 2024-05-01
PublicationDate_xml – month: 05
  year: 2024
  text: May 2024
PublicationDecade 2020
PublicationTitle Computers & graphics
PublicationYear 2024
Publisher Elsevier Ltd
Publisher_xml – name: Elsevier Ltd
References King (b40) 2009; 10
Blanz V, Vetter T. A morphable model for the synthesis of 3D faces. In: Proceedings of the 26th annual conference on computer graphics and interactive techniques. 1999, p. 187–94.
Lu, Chai, Cao (b15) 2021; 40
Fried, Tewari, Zollhöfer, Finkelstein, Shechtman, Goldman (b33) 2019; 38
Bregler C, Covell M, Slaney M. Video rewrite: Driving visual speech with audio. In: Proceedings of the 24th annual conference on computer graphics and interactive techniques. 1997, p. 353–60.
Hsu, Bolte, Tsai, Lakhotia, Salakhutdinov, Mohamed (b35) 2021; 29
Senst, Geistert, Sikora (b44) 2016
Nishimura, Sakata, Tominaga, Hijikata, Harada, Kiyokawa (b30) 2019
Chung, Glass (b32) 2020
Redmon, Farhadi (b45) 2018
Zhou H, Sun Y, Wu W, Loy C, Wang X, Liu Z. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation. In: 2021 IEEE/CVF conference on computer vision and pattern recognition. 2021.
Sumner, Popović (b49) 2004
Zhou, Xu, Landreth, Kalogerakis, Maji, Singh (b1) 2018; 37
Edwards, Landreth, Fiume, Singh (b27) 2016; 35
Cao, Weng, Zhou, Tong, Zhou (b8) 2013; 20
Cheng S, Kotsia I, Pantic M, Zafeiriou S. 4dfab: A large scale 4d database for facial expression analysis and biometric applications. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 5117–26.
Tian, Weng, Fang, Shen, Zhang (b46) 2023
Chen L, Maddox RK, Duan Z, Xu C. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 7832–41.
Zheng, Zhu, Song, Ji (b24) 2020
Bulat A, Tzimiropoulos G. How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). In: International conference on computer vision. 2017.
Amodei, Ananthanarayanan, Anubhai, Bai, Battenberg, Case (b31) 2016
Pham HX, Cheung S, Pavlovic V. Speech-driven 3D facial animation with implicit emotional awareness: A deep learning approach. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2017, p. 80–8.
Cheng K, Cun X, Zhang Y, Xia M, Yin F, Zhu M, et al. VideoReTalking: Audio-Based Lip Synchronization for Talking Head Video Editing In the Wild. In: SIGGRApH Asia 2022 conference papers. 2022.
Prajwal K, Mukhopadhyay R, Namboodiri VP, Jawahar C. A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM international conference on multimedia. 2020, p. 484–92.
Eskimez, Maddox, Xu, Duan (b14) 2020; 28
Nocentini F, Ferrari C, Berretti S. Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation. In: Image analysis and processing – ICIAP 2023. 2023, p. 340–51.
Take-Two Interactive Software (b42) 2010
Sun, Li, Guo, Zhao, Zheng, Si (b41) 2016
Hussen Abdelaziz A, Theobald B-J, Dixon P, Knothe R, Apostoloff N, Kajareker S. Modality dropout for improved performance-driven talking faces. In: Proceedings of the 2020 international conference on multimodal interaction. 2020, p. 378–86.
Tzirakis, Papaioannou, Lattas, Tarasiou, Schuller, Zafeiriou (b28) 2020
The Speech Group at Carnegie Mellon University (b16) 2015
Eskimez, Zhang, Duan (b21) 2021; 24
Zhang C, Zhao Y, Huang Y, Zeng M, Ni S, Budagavi M, et al. FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning. In: Proceedings of the IEEE/CVF international conference on computer vision. 2021, p. 3867–76.
Ardila, Branson, Davis, Henretty, Kohler, Meyer (b54) 2019
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez (b50) 2017; 30
Li, Bolkart, Black, Li, Romero (b25) 2017; 36
Fan Y, Lin Z, Saito J, Wang W, Komura T. Faceformer: Speech-driven 3d facial animation with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 18770–80.
Chen L, Li Z, Maddox RK, Duan Z, Xu C. Lip Movements Generation at a Glance. In: 15th European conference computer vision. ISBN: 978-3-030-01233-5, 2018, p. 538–53.
Zhu, Huang, Li, Zheng, He (b19) 2018
Feng, Feng, Black, Bolkart (b48) 2021; 40
Urmila Shrawankar (b29) 2013; 1
Paier W, Hilsmann A, Eisert P. Neural face models for example-based visual speech synthesis. In: Proceedings of the 17th ACM SIGGRApH European conference on visual media production. 2020, p. 1–10.
Cudeiro D, Bolkart T, Laidlaw C, Ranjan A, Black MJ. Capture, learning, and synthesis of 3D speaking styles. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 10101–11.
Suwajanakorn, Seitz, Kemelmacher-Shlizerman (b4) 2017; 36
Karras, Aila, Laine, Herva, Lehtinen (b55) 2017; 36
Cootes, Taylor, Cooper, Graham (b47) 1995; 61
MontrealCorpusTools (b52) 2018
Baevski, Zhou, Mohamed, Auli (b17) 2020; 33
Microsoft (b51) 2023
Chen L, Cui G, Liu C, Li Z, Kou Z, Xu Y, et al. Talking-Head Generation with Rhythmic Head Motion. In: Computer vision – ECCV 2020 proceedings. 2020, p. 35–51.
Song, Zhu, Li, Wang, Qi (b20) 2018
Vicon (b43) 2016
Fanelli, Gall, Romsdorfer, Weise, Van Gool (b37) 2010; 12
Li, Bladin, Zhao, Chinara, Ingraham, Xiang (b9) 2020
Yao, Fried, Fatahalian, Agrawala (b34) 2021; 40
Zheng (10.1016/j.cag.2024.103925_b24) 2020
Fanelli (10.1016/j.cag.2024.103925_b37) 2010; 12
Urmila Shrawankar (10.1016/j.cag.2024.103925_b29) 2013; 1
Cootes (10.1016/j.cag.2024.103925_b47) 1995; 61
Vaswani (10.1016/j.cag.2024.103925_b50) 2017; 30
The Speech Group at Carnegie Mellon University (10.1016/j.cag.2024.103925_b16) 2015
Edwards (10.1016/j.cag.2024.103925_b27) 2016; 35
Suwajanakorn (10.1016/j.cag.2024.103925_b4) 2017; 36
Zhou (10.1016/j.cag.2024.103925_b1) 2018; 37
Eskimez (10.1016/j.cag.2024.103925_b14) 2020; 28
Li (10.1016/j.cag.2024.103925_b9) 2020
Microsoft (10.1016/j.cag.2024.103925_b51) 2023
10.1016/j.cag.2024.103925_b18
Take-Two Interactive Software (10.1016/j.cag.2024.103925_b42) 2010
Fried (10.1016/j.cag.2024.103925_b33) 2019; 38
Tzirakis (10.1016/j.cag.2024.103925_b28) 2020
Sun (10.1016/j.cag.2024.103925_b41) 2016
10.1016/j.cag.2024.103925_b10
Amodei (10.1016/j.cag.2024.103925_b31) 2016
10.1016/j.cag.2024.103925_b53
10.1016/j.cag.2024.103925_b12
10.1016/j.cag.2024.103925_b11
10.1016/j.cag.2024.103925_b13
Lu (10.1016/j.cag.2024.103925_b15) 2021; 40
Hsu (10.1016/j.cag.2024.103925_b35) 2021; 29
10.1016/j.cag.2024.103925_b7
Song (10.1016/j.cag.2024.103925_b20) 2018
10.1016/j.cag.2024.103925_b6
10.1016/j.cag.2024.103925_b5
Nishimura (10.1016/j.cag.2024.103925_b30) 2019
10.1016/j.cag.2024.103925_b26
Vicon (10.1016/j.cag.2024.103925_b43) 2016
Karras (10.1016/j.cag.2024.103925_b55) 2017; 36
Tian (10.1016/j.cag.2024.103925_b46) 2023
10.1016/j.cag.2024.103925_b23
10.1016/j.cag.2024.103925_b22
Li (10.1016/j.cag.2024.103925_b25) 2017; 36
Cao (10.1016/j.cag.2024.103925_b8) 2013; 20
Baevski (10.1016/j.cag.2024.103925_b17) 2020; 33
Ardila (10.1016/j.cag.2024.103925_b54) 2019
Feng (10.1016/j.cag.2024.103925_b48) 2021; 40
Chung (10.1016/j.cag.2024.103925_b32) 2020
Yao (10.1016/j.cag.2024.103925_b34) 2021; 40
King (10.1016/j.cag.2024.103925_b40) 2009; 10
Zhu (10.1016/j.cag.2024.103925_b19) 2018
Redmon (10.1016/j.cag.2024.103925_b45) 2018
MontrealCorpusTools (10.1016/j.cag.2024.103925_b52) 2018
10.1016/j.cag.2024.103925_b38
Senst (10.1016/j.cag.2024.103925_b44) 2016
10.1016/j.cag.2024.103925_b39
Sumner (10.1016/j.cag.2024.103925_b49) 2004
10.1016/j.cag.2024.103925_b3
10.1016/j.cag.2024.103925_b2
Eskimez (10.1016/j.cag.2024.103925_b21) 2021; 24
10.1016/j.cag.2024.103925_b36
References_xml – reference: Fan Y, Lin Z, Saito J, Wang W, Komura T. Faceformer: Speech-driven 3d facial animation with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 18770–80.
– reference: Zhang C, Zhao Y, Huang Y, Zeng M, Ni S, Budagavi M, et al. FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning. In: Proceedings of the IEEE/CVF international conference on computer vision. 2021, p. 3867–76.
– start-page: 3497
  year: 2020
  end-page: 3501
  ident: b32
  article-title: Generative pre-training for speech with autoregressive predictive coding
  publication-title: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing
– reference: Chen L, Li Z, Maddox RK, Duan Z, Xu C. Lip Movements Generation at a Glance. In: 15th European conference computer vision. ISBN: 978-3-030-01233-5, 2018, p. 538–53.
– volume: 40
  start-page: 1
  year: 2021
  end-page: 14
  ident: b34
  article-title: Iterative text-based editing of talking-heads using neural retargeting
  publication-title: ACM Trans Graph
– volume: 36
  start-page: 1
  year: 2017
  end-page: 12
  ident: b55
  article-title: Audio-driven facial animation by joint end-to-end learning of pose and emotion
  publication-title: ACM Trans Graph
– volume: 24
  start-page: 3480
  year: 2021
  end-page: 3490
  ident: b21
  article-title: Speech driven talking face generation from a single image and an emotion condition
  publication-title: IEEE Trans Multimed
– volume: 40
  year: 2021
  ident: b48
  article-title: Learning an animatable detailed 3D face model from in-the-wild images
  publication-title: ACM Trans Graph (Proc. SIGGRAPH)
– reference: Nocentini F, Ferrari C, Berretti S. Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation. In: Image analysis and processing – ICIAP 2023. 2023, p. 340–51.
– year: 2018
  ident: b52
  article-title: Montreal forced aligner
– reference: Prajwal K, Mukhopadhyay R, Namboodiri VP, Jawahar C. A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM international conference on multimedia. 2020, p. 484–92.
– reference: Cudeiro D, Bolkart T, Laidlaw C, Ranjan A, Black MJ. Capture, learning, and synthesis of 3D speaking styles. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 10101–11.
– volume: 36
  start-page: 1
  year: 2017
  end-page: 13
  ident: b4
  article-title: Synthesizing obama: learning lip sync from audio
  publication-title: ACM Trans Graph (ToG)
– reference: Bulat A, Tzimiropoulos G. How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). In: International conference on computer vision. 2017.
– year: 2018
  ident: b19
  article-title: Arbitrary talking face generation via attentional audio-visual coherence learning
– reference: Chen L, Cui G, Liu C, Li Z, Kou Z, Xu Y, et al. Talking-Head Generation with Rhythmic Head Motion. In: Computer vision – ECCV 2020 proceedings. 2020, p. 35–51.
– reference: Chen L, Maddox RK, Duan Z, Xu C. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 7832–41.
– start-page: 1
  year: 2019
  end-page: 8
  ident: b30
  article-title: Speech-driven facial animation by lstm-rnn for communication use
  publication-title: 2019 12th Asia Pacific workshop on mixed and augmented reality
– reference: Pham HX, Cheung S, Pavlovic V. Speech-driven 3D facial animation with implicit emotional awareness: A deep learning approach. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2017, p. 80–8.
– start-page: 173
  year: 2016
  end-page: 182
  ident: b31
  article-title: Deep speech 2: End-to-end speech recognition in english and mandarin
  publication-title: International conference on machine learning
– year: 2015
  ident: b16
  article-title: The CMU pronouncing dictionary
– start-page: 399
  year: 2004
  end-page: 405
  ident: b49
  publication-title: Deformation transfer for triangle meshes
– start-page: 4478
  year: 2016
  end-page: 4482
  ident: b44
  article-title: Robust local optical flow: Long-range motions and varying illuminations
  publication-title: 2016 IEEE international conference on image processing
– reference: Cheng K, Cun X, Zhang Y, Xia M, Yin F, Zhu M, et al. VideoReTalking: Audio-Based Lip Synchronization for Talking Head Video Editing In the Wild. In: SIGGRApH Asia 2022 conference papers. 2022.
– volume: 35
  start-page: 1
  year: 2016
  end-page: 11
  ident: b27
  article-title: Jali: an animator-centric viseme model for expressive lip synchronization
  publication-title: ACM Trans Graph (TOG)
– volume: 12
  start-page: 591
  year: 2010
  end-page: 598
  ident: b37
  article-title: A 3-d audio-visual corpus of affective communication
  publication-title: IEEE Trans Multimed
– year: 2018
  ident: b45
  article-title: Yolov3: An incremental improvement
– year: 2019
  ident: b54
  article-title: Common voice: A massively-multilingual speech corpus
– start-page: 1
  year: 2023
  end-page: 19
  ident: b46
  article-title: Robust facial marker tracking based on a synthetic analysis of optical flows and the YOLO network
  publication-title: Vis Comput
– volume: 40
  year: 2021
  ident: b15
  article-title: Live speech portraits: Real-time photorealistic talking-head animation
  publication-title: ACM Trans Graph
– volume: 37
  year: 2018
  ident: b1
  article-title: Visemenet: Audio-driven animator-centric speech animation
  publication-title: ACM Trans Graph
– volume: 29
  start-page: 3451
  year: 2021
  end-page: 3460
  ident: b35
  article-title: Hubert: Self-supervised speech representation learning by masked prediction of hidden units
  publication-title: IEEE/ACM Trans Audio Speech Lang Process
– volume: 20
  start-page: 413
  year: 2013
  end-page: 425
  ident: b8
  article-title: Facewarehouse: A 3d facial expression database for visual computing
  publication-title: IEEE Trans Vis Comput Graphics
– start-page: 265
  year: 2020
  end-page: 272
  ident: b28
  article-title: Synthesising 3D facial motion from “in-the-wild” speech
  publication-title: 2020 15th IEEE international conference on automatic face and gesture recognition (FG 2020)
– reference: Hussen Abdelaziz A, Theobald B-J, Dixon P, Knothe R, Apostoloff N, Kajareker S. Modality dropout for improved performance-driven talking faces. In: Proceedings of the 2020 international conference on multimodal interaction. 2020, p. 378–86.
– year: 2020
  ident: b24
  article-title: Photorealistic lip sync with adversarial temporal convolutional networks
– volume: 10
  year: 2009
  ident: b40
  article-title: Dlib-ml: A machine learning toolkit
  publication-title: J Mach Learn Res
– year: 2016
  ident: b43
  article-title: Vicon
– reference: Paier W, Hilsmann A, Eisert P. Neural face models for example-based visual speech synthesis. In: Proceedings of the 17th ACM SIGGRApH European conference on visual media production. 2020, p. 1–10.
– volume: 30
  year: 2017
  ident: b50
  article-title: Attention is all you need
  publication-title: Adv Neural Inf Process Syst
– volume: 61
  start-page: 38
  year: 1995
  end-page: 59
  ident: b47
  article-title: Active shape models-their training and application
  publication-title: Comput Vis Image Underst
– year: 2016
  ident: b41
  article-title: THUCTC: An efficient Chinese text classifier
– start-page: 3407
  year: 2020
  end-page: 3416
  ident: b9
  article-title: Learning formation of physically-based face attributes
  publication-title: 2020 IEEE/CVF conference on computer vision and pattern recognition
– year: 2023
  ident: b51
  article-title: Bing speech API
– volume: 28
  start-page: 27
  year: 2020
  end-page: 38
  ident: b14
  article-title: Noise-resilient training method for face landmark generation from speech
  publication-title: IEEE/ACM Trans Audio Speech Lang Process
– volume: 33
  start-page: 12449
  year: 2020
  end-page: 12460
  ident: b17
  article-title: wav2vec 2.0: A framework for self-supervised learning of speech representations
  publication-title: Adv Neural Inf Process Syst
– reference: Zhou H, Sun Y, Wu W, Loy C, Wang X, Liu Z. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation. In: 2021 IEEE/CVF conference on computer vision and pattern recognition. 2021.
– reference: Bregler C, Covell M, Slaney M. Video rewrite: Driving visual speech with audio. In: Proceedings of the 24th annual conference on computer graphics and interactive techniques. 1997, p. 353–60.
– reference: Blanz V, Vetter T. A morphable model for the synthesis of 3D faces. In: Proceedings of the 26th annual conference on computer graphics and interactive techniques. 1999, p. 187–94.
– volume: 36
  start-page: 1
  year: 2017
  end-page: 194
  ident: b25
  article-title: Learning a model of facial shape and expression from 4D scans
  publication-title: ACM Trans Graph
– reference: Cheng S, Kotsia I, Pantic M, Zafeiriou S. 4dfab: A large scale 4d database for facial expression analysis and biometric applications. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 5117–26.
– year: 2018
  ident: b20
  article-title: Talking face generation by conditional recurrent adversarial network
– volume: 38
  year: 2019
  ident: b33
  article-title: Text-based editing of talking-head video
  publication-title: ACM Trans Graph
– year: 2010
  ident: b42
  article-title: Dynamixyz: Markerless facial motion capture solution for achievers
– volume: 1
  start-page: 412
  year: 2013
  end-page: 418
  ident: b29
  article-title: Techniques for feature extraction in speech recognition system: A comparative study
  publication-title: International Journal Of Computer Applications In Engineering, Technology and Sciences
– volume: 24
  start-page: 3480
  year: 2021
  ident: 10.1016/j.cag.2024.103925_b21
  article-title: Speech driven talking face generation from a single image and an emotion condition
  publication-title: IEEE Trans Multimed
  doi: 10.1109/TMM.2021.3099900
– start-page: 399
  year: 2004
  ident: 10.1016/j.cag.2024.103925_b49
– ident: 10.1016/j.cag.2024.103925_b38
  doi: 10.1145/3382507.3418840
– ident: 10.1016/j.cag.2024.103925_b10
  doi: 10.1109/CVPR.2019.01034
– volume: 40
  issue: 6
  year: 2021
  ident: 10.1016/j.cag.2024.103925_b15
  article-title: Live speech portraits: Real-time photorealistic talking-head animation
  publication-title: ACM Trans Graph
  doi: 10.1145/3478513.3480484
– ident: 10.1016/j.cag.2024.103925_b22
  doi: 10.1007/978-3-030-58545-7_3
– volume: 38
  issue: 4
  year: 2019
  ident: 10.1016/j.cag.2024.103925_b33
  article-title: Text-based editing of talking-head video
  publication-title: ACM Trans Graph
  doi: 10.1145/3306346.3323028
– volume: 36
  start-page: 1
  issue: 4
  year: 2017
  ident: 10.1016/j.cag.2024.103925_b55
  article-title: Audio-driven facial animation by joint end-to-end learning of pose and emotion
  publication-title: ACM Trans Graph
  doi: 10.1145/3072959.3073658
– year: 2015
  ident: 10.1016/j.cag.2024.103925_b16
– ident: 10.1016/j.cag.2024.103925_b13
  doi: 10.1109/ICCV.2017.116
– ident: 10.1016/j.cag.2024.103925_b26
  doi: 10.1109/CVPRW.2017.287
– start-page: 173
  year: 2016
  ident: 10.1016/j.cag.2024.103925_b31
  article-title: Deep speech 2: End-to-end speech recognition in english and mandarin
– year: 2023
  ident: 10.1016/j.cag.2024.103925_b51
– ident: 10.1016/j.cag.2024.103925_b5
  doi: 10.1109/ICCV48922.2021.00384
– volume: 40
  issue: 8
  year: 2021
  ident: 10.1016/j.cag.2024.103925_b48
  article-title: Learning an animatable detailed 3D face model from in-the-wild images
  publication-title: ACM Trans Graph (Proc. SIGGRAPH)
– ident: 10.1016/j.cag.2024.103925_b6
  doi: 10.1145/258734.258880
– year: 2020
  ident: 10.1016/j.cag.2024.103925_b24
– ident: 10.1016/j.cag.2024.103925_b18
  doi: 10.1109/CVPR.2019.00802
– volume: 29
  start-page: 3451
  year: 2021
  ident: 10.1016/j.cag.2024.103925_b35
  article-title: Hubert: Self-supervised speech representation learning by masked prediction of hidden units
  publication-title: IEEE/ACM Trans Audio Speech Lang Process
  doi: 10.1109/TASLP.2021.3122291
– year: 2016
  ident: 10.1016/j.cag.2024.103925_b41
– start-page: 1
  year: 2023
  ident: 10.1016/j.cag.2024.103925_b46
  article-title: Robust facial marker tracking based on a synthetic analysis of optical flows and the YOLO network
  publication-title: Vis Comput
– ident: 10.1016/j.cag.2024.103925_b7
  doi: 10.1145/311535.311556
– ident: 10.1016/j.cag.2024.103925_b53
  doi: 10.1007/978-3-030-01234-2_32
– year: 2018
  ident: 10.1016/j.cag.2024.103925_b45
– ident: 10.1016/j.cag.2024.103925_b3
  doi: 10.1145/3550469.3555399
– ident: 10.1016/j.cag.2024.103925_b11
  doi: 10.1109/CVPR52688.2022.01821
– volume: 1
  start-page: 412
  year: 2013
  ident: 10.1016/j.cag.2024.103925_b29
  article-title: Techniques for feature extraction in speech recognition system: A comparative study
  publication-title: International Journal Of Computer Applications In Engineering, Technology and Sciences
– year: 2018
  ident: 10.1016/j.cag.2024.103925_b20
– volume: 10
  year: 2009
  ident: 10.1016/j.cag.2024.103925_b40
  article-title: Dlib-ml: A machine learning toolkit
  publication-title: J Mach Learn Res
– start-page: 4478
  year: 2016
  ident: 10.1016/j.cag.2024.103925_b44
  article-title: Robust local optical flow: Long-range motions and varying illuminations
– start-page: 3497
  year: 2020
  ident: 10.1016/j.cag.2024.103925_b32
  article-title: Generative pre-training for speech with autoregressive predictive coding
– volume: 30
  year: 2017
  ident: 10.1016/j.cag.2024.103925_b50
  article-title: Attention is all you need
  publication-title: Adv Neural Inf Process Syst
– start-page: 265
  year: 2020
  ident: 10.1016/j.cag.2024.103925_b28
  article-title: Synthesising 3D facial motion from “in-the-wild” speech
– volume: 61
  start-page: 38
  issue: 1
  year: 1995
  ident: 10.1016/j.cag.2024.103925_b47
  article-title: Active shape models-their training and application
  publication-title: Comput Vis Image Underst
  doi: 10.1006/cviu.1995.1004
– volume: 33
  start-page: 12449
  year: 2020
  ident: 10.1016/j.cag.2024.103925_b17
  article-title: wav2vec 2.0: A framework for self-supervised learning of speech representations
  publication-title: Adv Neural Inf Process Syst
– volume: 28
  start-page: 27
  year: 2020
  ident: 10.1016/j.cag.2024.103925_b14
  article-title: Noise-resilient training method for face landmark generation from speech
  publication-title: IEEE/ACM Trans Audio Speech Lang Process
  doi: 10.1109/TASLP.2019.2947741
– ident: 10.1016/j.cag.2024.103925_b12
  doi: 10.1007/978-3-031-43148-7_29
– volume: 36
  start-page: 1
  issue: 6
  year: 2017
  ident: 10.1016/j.cag.2024.103925_b25
  article-title: Learning a model of facial shape and expression from 4D scans
  publication-title: ACM Trans Graph
– ident: 10.1016/j.cag.2024.103925_b39
  doi: 10.1145/3429341.3429356
– start-page: 3407
  year: 2020
  ident: 10.1016/j.cag.2024.103925_b9
  article-title: Learning formation of physically-based face attributes
– year: 2019
  ident: 10.1016/j.cag.2024.103925_b54
– volume: 20
  start-page: 413
  issue: 3
  year: 2013
  ident: 10.1016/j.cag.2024.103925_b8
  article-title: Facewarehouse: A 3d facial expression database for visual computing
  publication-title: IEEE Trans Vis Comput Graphics
– ident: 10.1016/j.cag.2024.103925_b36
  doi: 10.1109/CVPR.2018.00537
– volume: 12
  start-page: 591
  issue: 6
  year: 2010
  ident: 10.1016/j.cag.2024.103925_b37
  article-title: A 3-d audio-visual corpus of affective communication
  publication-title: IEEE Trans Multimed
  doi: 10.1109/TMM.2010.2052239
– year: 2016
  ident: 10.1016/j.cag.2024.103925_b43
– ident: 10.1016/j.cag.2024.103925_b23
  doi: 10.1109/CVPR46437.2021.00416
– ident: 10.1016/j.cag.2024.103925_b2
  doi: 10.1145/3394171.3413532
– volume: 40
  start-page: 1
  issue: 3
  year: 2021
  ident: 10.1016/j.cag.2024.103925_b34
  article-title: Iterative text-based editing of talking-heads using neural retargeting
  publication-title: ACM Trans Graph
  doi: 10.1145/3449063
– volume: 37
  issue: 4
  year: 2018
  ident: 10.1016/j.cag.2024.103925_b1
  article-title: Visemenet: Audio-driven animator-centric speech animation
  publication-title: ACM Trans Graph
  doi: 10.1145/3197517.3201292
– year: 2010
  ident: 10.1016/j.cag.2024.103925_b42
– volume: 35
  start-page: 1
  issue: 4
  year: 2016
  ident: 10.1016/j.cag.2024.103925_b27
  article-title: Jali: an animator-centric viseme model for expressive lip synchronization
  publication-title: ACM Trans Graph (TOG)
  doi: 10.1145/2897824.2925984
– year: 2018
  ident: 10.1016/j.cag.2024.103925_b19
– start-page: 1
  year: 2019
  ident: 10.1016/j.cag.2024.103925_b30
  article-title: Speech-driven facial animation by lstm-rnn for communication use
– volume: 36
  start-page: 1
  issue: 4
  year: 2017
  ident: 10.1016/j.cag.2024.103925_b4
  article-title: Synthesizing obama: learning lip sync from audio
  publication-title: ACM Trans Graph (ToG)
  doi: 10.1145/3072959.3073640
– year: 2018
  ident: 10.1016/j.cag.2024.103925_b52
SSID ssj0002264
Score 2.3777554
Snippet Generating talking lips in sync with input speech has the potential to enhance speech communication and enable novel applications. This paper presents a system...
SourceID crossref
elsevier
SourceType Enrichment Source
Index Database
Publisher
StartPage 103925
SubjectTerms 3D talking meshes
Landmarks
Lip animation
Viseme
Title Audio-to-Deep-Lip: Speaking lip synthesis based on 3D landmarks
URI https://dx.doi.org/10.1016/j.cag.2024.103925
Volume 120
WOSCitedRecordID wos001242618700001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: Elsevier SD Freedom Collection Journals 2021
  issn: 0097-8493
  databaseCode: AIEXJ
  dateStart: 19950101
  customDbUrl:
  isFulltext: true
  dateEnd: 99991231
  titleUrlDefault: https://www.sciencedirect.com
  omitProxy: false
  ssIdentifier: ssj0002264
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9QwELZgywEOiKcoL_nACeQqD2cTc0ErtlVBVYXEHlZcosSPKm1Jon2g8u-ZsZ0HLUX0wMXaWMls5M_xjGfG3xDypgxiGWqMqWe8YFzphJXKxEyD6pSq1ElpLNJH6fFxtlyKL7681dqWE0jrOru4EO1_hRr6AGw8OnsDuHuh0AG_AXRoAXZo_wn42VZVDds0bK51y46qFjf9X1tty069O69aZCkAsw-ZSFCHKYwXxHP0XKrvxepsPbZXu6IPaztFLLv1KD3-wDubD7fVEOFxXfOmPlGNV4s2zdd5Wr_pn9vBC24VgOf-9q6HiA-Jft1yKkDFcVfisF9Oo2C0IGKk2Z1svrJWO7fBKezDT_ZQ-t5w7--82Jf0VZ9F2CWoneYgIkcRuRNxm-xEaSKyCdmZfdpffu5VM54adrSk7r27MLdN-Lv0Hn82VEbGx-IBue93DXTm0H5Ibun6Ebk34pJ8TD5cwf097VCngDrtUacWddrUNJ7THvUnZHGwv_h4yHx1DCYjkW6Y1FEaCBOXMpyaMCxMGhRCyxT5gYyZlomKsFiYzLIyKE1iIimDpCjwYLPmYGc-JZO6qfUzQtHkg9EKeZEoPo2kgAspAsnhOzbQ7JKgG4lceuZ4LGBynl-LwC552z_SOtqUv93Mu-HNvd3n7Lkcpsr1jz2_yX-8IHeHGfySTDarrX5F7sgfm2q9eu3nyS-eV3KX
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Audio-to-Deep-Lip%3A+Speaking+lip+synthesis+based+on+3D+landmarks&rft.jtitle=Computers+%26+graphics&rft.au=Fang%2C+Hui&rft.au=Weng%2C+Dongdong&rft.au=Tian%2C+Zeyu&rft.au=Ma%2C+Yin&rft.date=2024-05-01&rft.issn=0097-8493&rft.volume=120&rft.spage=103925&rft_id=info:doi/10.1016%2Fj.cag.2024.103925&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_cag_2024_103925
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0097-8493&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0097-8493&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0097-8493&client=summon