Audio-to-Deep-Lip: Speaking lip synthesis based on 3D landmarks
Generating talking lips in sync with input speech has the potential to enhance speech communication and enable novel applications. This paper presents a system that can generate accurate 3D talking lips, readily applicable to unseen subjects and different languages. The developed head-mounted facial...
Uložené v:
| Vydané v: | Computers & graphics Ročník 120; s. 103925 |
|---|---|
| Hlavní autori: | , , , , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
Elsevier Ltd
01.05.2024
|
| Predmet: | |
| ISSN: | 0097-8493 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | Generating talking lips in sync with input speech has the potential to enhance speech communication and enable novel applications. This paper presents a system that can generate accurate 3D talking lips, readily applicable to unseen subjects and different languages. The developed head-mounted facial acquisition device and automated data processing pipeline can generate precise landmarks while mitigating the difficulty of acquiring 3D facial data. Our system consists of three stages to generate accurate lip movements. In the first stage, the fine-tuned Wav2Vec2.0+Transformer captures long-range audio context dependencies. In the second stage, we propose the Viseme Fixing method, which significantly improves lip accuracy at/b//p//m//f/ phonemes. In the last stage, we innovatively use the structural relationship between the inner and outer lips and learn to map the outer lip landmarks to the inner lip landmarks. Subjective evaluations show that the generated talking lips match the input audio significantly. We demonstrate two applications that animate 2D face videos and 3D face models using our landmarks. The precise lip landmarks allow the generated animations to exceed the results of state-of-the-art methods.
[Display omitted]
•Generate accurate 3D talking lips for unseen subjects and different languages.•Develop a head-mounted facial acquisition device to acquire precise 3D facial data.•Propose Viseme Fixing algorithm to enhance the lip accuracy of /b//p//m//f/ visemes.•Make two applications that bridge 3D landmarks to talking videos and talking meshes. |
|---|---|
| AbstractList | Generating talking lips in sync with input speech has the potential to enhance speech communication and enable novel applications. This paper presents a system that can generate accurate 3D talking lips, readily applicable to unseen subjects and different languages. The developed head-mounted facial acquisition device and automated data processing pipeline can generate precise landmarks while mitigating the difficulty of acquiring 3D facial data. Our system consists of three stages to generate accurate lip movements. In the first stage, the fine-tuned Wav2Vec2.0+Transformer captures long-range audio context dependencies. In the second stage, we propose the Viseme Fixing method, which significantly improves lip accuracy at/b//p//m//f/ phonemes. In the last stage, we innovatively use the structural relationship between the inner and outer lips and learn to map the outer lip landmarks to the inner lip landmarks. Subjective evaluations show that the generated talking lips match the input audio significantly. We demonstrate two applications that animate 2D face videos and 3D face models using our landmarks. The precise lip landmarks allow the generated animations to exceed the results of state-of-the-art methods.
[Display omitted]
•Generate accurate 3D talking lips for unseen subjects and different languages.•Develop a head-mounted facial acquisition device to acquire precise 3D facial data.•Propose Viseme Fixing algorithm to enhance the lip accuracy of /b//p//m//f/ visemes.•Make two applications that bridge 3D landmarks to talking videos and talking meshes. |
| ArticleNumber | 103925 |
| Author | Fang, Hui Tian, Zeyu Weng, Dongdong Lu, Xiangju Ma, Yin |
| Author_xml | – sequence: 1 givenname: Hui orcidid: 0009-0002-3505-3308 surname: Fang fullname: Fang, Hui organization: Beijing Engineering Research Center of Mixed Reality and Advanced Display, School of Optics and Photonics, Beijing Institute of Technology, No. 5 Zhongguancun South Street, Beijing, 100081, China – sequence: 2 givenname: Dongdong surname: Weng fullname: Weng, Dongdong email: crgj@bit.edu.cn organization: Beijing Engineering Research Center of Mixed Reality and Advanced Display, School of Optics and Photonics, Beijing Institute of Technology, No. 5 Zhongguancun South Street, Beijing, 100081, China – sequence: 3 givenname: Zeyu surname: Tian fullname: Tian, Zeyu organization: Beijing Engineering Research Center of Mixed Reality and Advanced Display, School of Optics and Photonics, Beijing Institute of Technology, No. 5 Zhongguancun South Street, Beijing, 100081, China – sequence: 4 givenname: Yin surname: Ma fullname: Ma, Yin organization: Ningxia Baofeng Group Co. LTD., No. 19, International Trade City, Yinchuan, Ningxia, 750003, China – sequence: 5 givenname: Xiangju surname: Lu fullname: Lu, Xiangju organization: iQIYI Inc., Hongcheng Building, No. 2 North First Street, Haidian, Beijing, 100027, China |
| BookMark | eNp9z8FOAjEQgOEeMBHQB_DWFyi23W6X1YMhIGpC4kHuTbedYmFtN9tqwtu7BE8eOE3m8E_mm6BRiAEQumN0xiiT9_uZ0bsZp1wMe1HzcoTGlNYVmYu6uEaTlPaUUs6lGKOnxbf1keRIVgAd2fjuAX90oA8-7HDrO5yOIX9C8gk3OoHFMeBihVsd7JfuD-kGXTndJrj9m1O0XT9vl69k8_7ytlxsiOF1lYkBXtHaFY1h0jGmXUV1DaYSpRTOyaa03ErbmPm8oY0rHTeGllpLwRkIKYspYuezpo8p9eBU1_vhgaNiVJ3Qaq8GtDqh1Rk9NNW_xviss48h99q3F8vHcwmD6MdDr5LxEAxY34PJykZ_of4Ft4502Q |
| CitedBy_id | crossref_primary_10_1016_j_cag_2024_103960 |
| Cites_doi | 10.1109/TMM.2021.3099900 10.1145/3382507.3418840 10.1109/CVPR.2019.01034 10.1145/3478513.3480484 10.1007/978-3-030-58545-7_3 10.1145/3306346.3323028 10.1145/3072959.3073658 10.1109/ICCV.2017.116 10.1109/CVPRW.2017.287 10.1109/ICCV48922.2021.00384 10.1145/258734.258880 10.1109/CVPR.2019.00802 10.1109/TASLP.2021.3122291 10.1145/311535.311556 10.1007/978-3-030-01234-2_32 10.1145/3550469.3555399 10.1109/CVPR52688.2022.01821 10.1006/cviu.1995.1004 10.1109/TASLP.2019.2947741 10.1007/978-3-031-43148-7_29 10.1145/3429341.3429356 10.1109/CVPR.2018.00537 10.1109/TMM.2010.2052239 10.1109/CVPR46437.2021.00416 10.1145/3394171.3413532 10.1145/3449063 10.1145/3197517.3201292 10.1145/2897824.2925984 10.1145/3072959.3073640 |
| ContentType | Journal Article |
| Copyright | 2024 Elsevier Ltd |
| Copyright_xml | – notice: 2024 Elsevier Ltd |
| DBID | AAYXX CITATION |
| DOI | 10.1016/j.cag.2024.103925 |
| DatabaseName | CrossRef |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering |
| ExternalDocumentID | 10_1016_j_cag_2024_103925 S0097849324000608 |
| GroupedDBID | --K --M -~X .DC .~1 0R~ 1B1 1~. 1~5 29F 4.4 457 4G. 5GY 5VS 6TJ 7-5 71M 8P~ 9JN AACTN AAEDT AAEDW AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AATTM AAXKI AAXUO AAYFN ABAOU ABBOA ABDPE ABEFU ABJNI ABMAC ABTAH ABWVN ABXDB ACDAQ ACGFS ACNNM ACRLP ACRPL ACZNC ADBBV ADEZE ADGUI ADJOM ADMUD ADNMO AEBSH AEIPS AEKER AFFNX AFJKZ AFTJW AGHFR AGSOS AGUBO AGYEJ AHHHB AHZHX AI. AIALX AIEXJ AIGVJ AIKHN AITUG ALMA_UNASSIGNED_HOLDINGS AMRAJ ANKPU AOUOD ARUGR ASPBG AVWKF AXJTR AZFZN BKOJK BLXMC BNPGV CS3 EBS EFJIC EJD EO8 EO9 EP2 EP3 F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-Q GBLVA GBOLZ HLZ HVGLF HZ~ H~9 IHE J1W K-O KOM LG9 M41 MHUIS MO0 N9A O-L O9- OAUVE OZT P-8 P-9 PC. Q38 R2- RIG ROL RPZ SBC SDF SDG SDP SES SEW SPC SPCBC SSH SSV SSW SSZ T5K TN5 UHS VH1 VOH WH7 WUQ XPP ZMT ZY4 ~02 ~G- 9DU AAYWO AAYXX ACLOT AGQPQ AIIUN APXCP CITATION EFKBS EFLBG ~HD |
| ID | FETCH-LOGICAL-c297t-ce2709f3bc16f11af70a9ec74564ff6b5d2d6dbc88b0bf5f2cc05aa6421e4663 |
| ISICitedReferencesCount | 1 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001242618700001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0097-8493 |
| IngestDate | Sat Nov 29 05:33:30 EST 2025 Tue Nov 18 20:47:30 EST 2025 Sun Apr 06 06:54:33 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | Landmarks 3D talking meshes Viseme Lip animation |
| Language | English |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c297t-ce2709f3bc16f11af70a9ec74564ff6b5d2d6dbc88b0bf5f2cc05aa6421e4663 |
| ORCID | 0009-0002-3505-3308 |
| ParticipantIDs | crossref_primary_10_1016_j_cag_2024_103925 crossref_citationtrail_10_1016_j_cag_2024_103925 elsevier_sciencedirect_doi_10_1016_j_cag_2024_103925 |
| PublicationCentury | 2000 |
| PublicationDate | May 2024 2024-05-00 |
| PublicationDateYYYYMMDD | 2024-05-01 |
| PublicationDate_xml | – month: 05 year: 2024 text: May 2024 |
| PublicationDecade | 2020 |
| PublicationTitle | Computers & graphics |
| PublicationYear | 2024 |
| Publisher | Elsevier Ltd |
| Publisher_xml | – name: Elsevier Ltd |
| References | King (b40) 2009; 10 Blanz V, Vetter T. A morphable model for the synthesis of 3D faces. In: Proceedings of the 26th annual conference on computer graphics and interactive techniques. 1999, p. 187–94. Lu, Chai, Cao (b15) 2021; 40 Fried, Tewari, Zollhöfer, Finkelstein, Shechtman, Goldman (b33) 2019; 38 Bregler C, Covell M, Slaney M. Video rewrite: Driving visual speech with audio. In: Proceedings of the 24th annual conference on computer graphics and interactive techniques. 1997, p. 353–60. Hsu, Bolte, Tsai, Lakhotia, Salakhutdinov, Mohamed (b35) 2021; 29 Senst, Geistert, Sikora (b44) 2016 Nishimura, Sakata, Tominaga, Hijikata, Harada, Kiyokawa (b30) 2019 Chung, Glass (b32) 2020 Redmon, Farhadi (b45) 2018 Zhou H, Sun Y, Wu W, Loy C, Wang X, Liu Z. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation. In: 2021 IEEE/CVF conference on computer vision and pattern recognition. 2021. Sumner, Popović (b49) 2004 Zhou, Xu, Landreth, Kalogerakis, Maji, Singh (b1) 2018; 37 Edwards, Landreth, Fiume, Singh (b27) 2016; 35 Cao, Weng, Zhou, Tong, Zhou (b8) 2013; 20 Cheng S, Kotsia I, Pantic M, Zafeiriou S. 4dfab: A large scale 4d database for facial expression analysis and biometric applications. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 5117–26. Tian, Weng, Fang, Shen, Zhang (b46) 2023 Chen L, Maddox RK, Duan Z, Xu C. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 7832–41. Zheng, Zhu, Song, Ji (b24) 2020 Bulat A, Tzimiropoulos G. How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). In: International conference on computer vision. 2017. Amodei, Ananthanarayanan, Anubhai, Bai, Battenberg, Case (b31) 2016 Pham HX, Cheung S, Pavlovic V. Speech-driven 3D facial animation with implicit emotional awareness: A deep learning approach. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2017, p. 80–8. Cheng K, Cun X, Zhang Y, Xia M, Yin F, Zhu M, et al. VideoReTalking: Audio-Based Lip Synchronization for Talking Head Video Editing In the Wild. In: SIGGRApH Asia 2022 conference papers. 2022. Prajwal K, Mukhopadhyay R, Namboodiri VP, Jawahar C. A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM international conference on multimedia. 2020, p. 484–92. Eskimez, Maddox, Xu, Duan (b14) 2020; 28 Nocentini F, Ferrari C, Berretti S. Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation. In: Image analysis and processing – ICIAP 2023. 2023, p. 340–51. Take-Two Interactive Software (b42) 2010 Sun, Li, Guo, Zhao, Zheng, Si (b41) 2016 Hussen Abdelaziz A, Theobald B-J, Dixon P, Knothe R, Apostoloff N, Kajareker S. Modality dropout for improved performance-driven talking faces. In: Proceedings of the 2020 international conference on multimodal interaction. 2020, p. 378–86. Tzirakis, Papaioannou, Lattas, Tarasiou, Schuller, Zafeiriou (b28) 2020 The Speech Group at Carnegie Mellon University (b16) 2015 Eskimez, Zhang, Duan (b21) 2021; 24 Zhang C, Zhao Y, Huang Y, Zeng M, Ni S, Budagavi M, et al. FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning. In: Proceedings of the IEEE/CVF international conference on computer vision. 2021, p. 3867–76. Ardila, Branson, Davis, Henretty, Kohler, Meyer (b54) 2019 Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez (b50) 2017; 30 Li, Bolkart, Black, Li, Romero (b25) 2017; 36 Fan Y, Lin Z, Saito J, Wang W, Komura T. Faceformer: Speech-driven 3d facial animation with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 18770–80. Chen L, Li Z, Maddox RK, Duan Z, Xu C. Lip Movements Generation at a Glance. In: 15th European conference computer vision. ISBN: 978-3-030-01233-5, 2018, p. 538–53. Zhu, Huang, Li, Zheng, He (b19) 2018 Feng, Feng, Black, Bolkart (b48) 2021; 40 Urmila Shrawankar (b29) 2013; 1 Paier W, Hilsmann A, Eisert P. Neural face models for example-based visual speech synthesis. In: Proceedings of the 17th ACM SIGGRApH European conference on visual media production. 2020, p. 1–10. Cudeiro D, Bolkart T, Laidlaw C, Ranjan A, Black MJ. Capture, learning, and synthesis of 3D speaking styles. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 10101–11. Suwajanakorn, Seitz, Kemelmacher-Shlizerman (b4) 2017; 36 Karras, Aila, Laine, Herva, Lehtinen (b55) 2017; 36 Cootes, Taylor, Cooper, Graham (b47) 1995; 61 MontrealCorpusTools (b52) 2018 Baevski, Zhou, Mohamed, Auli (b17) 2020; 33 Microsoft (b51) 2023 Chen L, Cui G, Liu C, Li Z, Kou Z, Xu Y, et al. Talking-Head Generation with Rhythmic Head Motion. In: Computer vision – ECCV 2020 proceedings. 2020, p. 35–51. Song, Zhu, Li, Wang, Qi (b20) 2018 Vicon (b43) 2016 Fanelli, Gall, Romsdorfer, Weise, Van Gool (b37) 2010; 12 Li, Bladin, Zhao, Chinara, Ingraham, Xiang (b9) 2020 Yao, Fried, Fatahalian, Agrawala (b34) 2021; 40 Zheng (10.1016/j.cag.2024.103925_b24) 2020 Fanelli (10.1016/j.cag.2024.103925_b37) 2010; 12 Urmila Shrawankar (10.1016/j.cag.2024.103925_b29) 2013; 1 Cootes (10.1016/j.cag.2024.103925_b47) 1995; 61 Vaswani (10.1016/j.cag.2024.103925_b50) 2017; 30 The Speech Group at Carnegie Mellon University (10.1016/j.cag.2024.103925_b16) 2015 Edwards (10.1016/j.cag.2024.103925_b27) 2016; 35 Suwajanakorn (10.1016/j.cag.2024.103925_b4) 2017; 36 Zhou (10.1016/j.cag.2024.103925_b1) 2018; 37 Eskimez (10.1016/j.cag.2024.103925_b14) 2020; 28 Li (10.1016/j.cag.2024.103925_b9) 2020 Microsoft (10.1016/j.cag.2024.103925_b51) 2023 10.1016/j.cag.2024.103925_b18 Take-Two Interactive Software (10.1016/j.cag.2024.103925_b42) 2010 Fried (10.1016/j.cag.2024.103925_b33) 2019; 38 Tzirakis (10.1016/j.cag.2024.103925_b28) 2020 Sun (10.1016/j.cag.2024.103925_b41) 2016 10.1016/j.cag.2024.103925_b10 Amodei (10.1016/j.cag.2024.103925_b31) 2016 10.1016/j.cag.2024.103925_b53 10.1016/j.cag.2024.103925_b12 10.1016/j.cag.2024.103925_b11 10.1016/j.cag.2024.103925_b13 Lu (10.1016/j.cag.2024.103925_b15) 2021; 40 Hsu (10.1016/j.cag.2024.103925_b35) 2021; 29 10.1016/j.cag.2024.103925_b7 Song (10.1016/j.cag.2024.103925_b20) 2018 10.1016/j.cag.2024.103925_b6 10.1016/j.cag.2024.103925_b5 Nishimura (10.1016/j.cag.2024.103925_b30) 2019 10.1016/j.cag.2024.103925_b26 Vicon (10.1016/j.cag.2024.103925_b43) 2016 Karras (10.1016/j.cag.2024.103925_b55) 2017; 36 Tian (10.1016/j.cag.2024.103925_b46) 2023 10.1016/j.cag.2024.103925_b23 10.1016/j.cag.2024.103925_b22 Li (10.1016/j.cag.2024.103925_b25) 2017; 36 Cao (10.1016/j.cag.2024.103925_b8) 2013; 20 Baevski (10.1016/j.cag.2024.103925_b17) 2020; 33 Ardila (10.1016/j.cag.2024.103925_b54) 2019 Feng (10.1016/j.cag.2024.103925_b48) 2021; 40 Chung (10.1016/j.cag.2024.103925_b32) 2020 Yao (10.1016/j.cag.2024.103925_b34) 2021; 40 King (10.1016/j.cag.2024.103925_b40) 2009; 10 Zhu (10.1016/j.cag.2024.103925_b19) 2018 Redmon (10.1016/j.cag.2024.103925_b45) 2018 MontrealCorpusTools (10.1016/j.cag.2024.103925_b52) 2018 10.1016/j.cag.2024.103925_b38 Senst (10.1016/j.cag.2024.103925_b44) 2016 10.1016/j.cag.2024.103925_b39 Sumner (10.1016/j.cag.2024.103925_b49) 2004 10.1016/j.cag.2024.103925_b3 10.1016/j.cag.2024.103925_b2 Eskimez (10.1016/j.cag.2024.103925_b21) 2021; 24 10.1016/j.cag.2024.103925_b36 |
| References_xml | – reference: Fan Y, Lin Z, Saito J, Wang W, Komura T. Faceformer: Speech-driven 3d facial animation with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, p. 18770–80. – reference: Zhang C, Zhao Y, Huang Y, Zeng M, Ni S, Budagavi M, et al. FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning. In: Proceedings of the IEEE/CVF international conference on computer vision. 2021, p. 3867–76. – start-page: 3497 year: 2020 end-page: 3501 ident: b32 article-title: Generative pre-training for speech with autoregressive predictive coding publication-title: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing – reference: Chen L, Li Z, Maddox RK, Duan Z, Xu C. Lip Movements Generation at a Glance. In: 15th European conference computer vision. ISBN: 978-3-030-01233-5, 2018, p. 538–53. – volume: 40 start-page: 1 year: 2021 end-page: 14 ident: b34 article-title: Iterative text-based editing of talking-heads using neural retargeting publication-title: ACM Trans Graph – volume: 36 start-page: 1 year: 2017 end-page: 12 ident: b55 article-title: Audio-driven facial animation by joint end-to-end learning of pose and emotion publication-title: ACM Trans Graph – volume: 24 start-page: 3480 year: 2021 end-page: 3490 ident: b21 article-title: Speech driven talking face generation from a single image and an emotion condition publication-title: IEEE Trans Multimed – volume: 40 year: 2021 ident: b48 article-title: Learning an animatable detailed 3D face model from in-the-wild images publication-title: ACM Trans Graph (Proc. SIGGRAPH) – reference: Nocentini F, Ferrari C, Berretti S. Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation. In: Image analysis and processing – ICIAP 2023. 2023, p. 340–51. – year: 2018 ident: b52 article-title: Montreal forced aligner – reference: Prajwal K, Mukhopadhyay R, Namboodiri VP, Jawahar C. A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM international conference on multimedia. 2020, p. 484–92. – reference: Cudeiro D, Bolkart T, Laidlaw C, Ranjan A, Black MJ. Capture, learning, and synthesis of 3D speaking styles. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 10101–11. – volume: 36 start-page: 1 year: 2017 end-page: 13 ident: b4 article-title: Synthesizing obama: learning lip sync from audio publication-title: ACM Trans Graph (ToG) – reference: Bulat A, Tzimiropoulos G. How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). In: International conference on computer vision. 2017. – year: 2018 ident: b19 article-title: Arbitrary talking face generation via attentional audio-visual coherence learning – reference: Chen L, Cui G, Liu C, Li Z, Kou Z, Xu Y, et al. Talking-Head Generation with Rhythmic Head Motion. In: Computer vision – ECCV 2020 proceedings. 2020, p. 35–51. – reference: Chen L, Maddox RK, Duan Z, Xu C. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 7832–41. – start-page: 1 year: 2019 end-page: 8 ident: b30 article-title: Speech-driven facial animation by lstm-rnn for communication use publication-title: 2019 12th Asia Pacific workshop on mixed and augmented reality – reference: Pham HX, Cheung S, Pavlovic V. Speech-driven 3D facial animation with implicit emotional awareness: A deep learning approach. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2017, p. 80–8. – start-page: 173 year: 2016 end-page: 182 ident: b31 article-title: Deep speech 2: End-to-end speech recognition in english and mandarin publication-title: International conference on machine learning – year: 2015 ident: b16 article-title: The CMU pronouncing dictionary – start-page: 399 year: 2004 end-page: 405 ident: b49 publication-title: Deformation transfer for triangle meshes – start-page: 4478 year: 2016 end-page: 4482 ident: b44 article-title: Robust local optical flow: Long-range motions and varying illuminations publication-title: 2016 IEEE international conference on image processing – reference: Cheng K, Cun X, Zhang Y, Xia M, Yin F, Zhu M, et al. VideoReTalking: Audio-Based Lip Synchronization for Talking Head Video Editing In the Wild. In: SIGGRApH Asia 2022 conference papers. 2022. – volume: 35 start-page: 1 year: 2016 end-page: 11 ident: b27 article-title: Jali: an animator-centric viseme model for expressive lip synchronization publication-title: ACM Trans Graph (TOG) – volume: 12 start-page: 591 year: 2010 end-page: 598 ident: b37 article-title: A 3-d audio-visual corpus of affective communication publication-title: IEEE Trans Multimed – year: 2018 ident: b45 article-title: Yolov3: An incremental improvement – year: 2019 ident: b54 article-title: Common voice: A massively-multilingual speech corpus – start-page: 1 year: 2023 end-page: 19 ident: b46 article-title: Robust facial marker tracking based on a synthetic analysis of optical flows and the YOLO network publication-title: Vis Comput – volume: 40 year: 2021 ident: b15 article-title: Live speech portraits: Real-time photorealistic talking-head animation publication-title: ACM Trans Graph – volume: 37 year: 2018 ident: b1 article-title: Visemenet: Audio-driven animator-centric speech animation publication-title: ACM Trans Graph – volume: 29 start-page: 3451 year: 2021 end-page: 3460 ident: b35 article-title: Hubert: Self-supervised speech representation learning by masked prediction of hidden units publication-title: IEEE/ACM Trans Audio Speech Lang Process – volume: 20 start-page: 413 year: 2013 end-page: 425 ident: b8 article-title: Facewarehouse: A 3d facial expression database for visual computing publication-title: IEEE Trans Vis Comput Graphics – start-page: 265 year: 2020 end-page: 272 ident: b28 article-title: Synthesising 3D facial motion from “in-the-wild” speech publication-title: 2020 15th IEEE international conference on automatic face and gesture recognition (FG 2020) – reference: Hussen Abdelaziz A, Theobald B-J, Dixon P, Knothe R, Apostoloff N, Kajareker S. Modality dropout for improved performance-driven talking faces. In: Proceedings of the 2020 international conference on multimodal interaction. 2020, p. 378–86. – year: 2020 ident: b24 article-title: Photorealistic lip sync with adversarial temporal convolutional networks – volume: 10 year: 2009 ident: b40 article-title: Dlib-ml: A machine learning toolkit publication-title: J Mach Learn Res – year: 2016 ident: b43 article-title: Vicon – reference: Paier W, Hilsmann A, Eisert P. Neural face models for example-based visual speech synthesis. In: Proceedings of the 17th ACM SIGGRApH European conference on visual media production. 2020, p. 1–10. – volume: 30 year: 2017 ident: b50 article-title: Attention is all you need publication-title: Adv Neural Inf Process Syst – volume: 61 start-page: 38 year: 1995 end-page: 59 ident: b47 article-title: Active shape models-their training and application publication-title: Comput Vis Image Underst – year: 2016 ident: b41 article-title: THUCTC: An efficient Chinese text classifier – start-page: 3407 year: 2020 end-page: 3416 ident: b9 article-title: Learning formation of physically-based face attributes publication-title: 2020 IEEE/CVF conference on computer vision and pattern recognition – year: 2023 ident: b51 article-title: Bing speech API – volume: 28 start-page: 27 year: 2020 end-page: 38 ident: b14 article-title: Noise-resilient training method for face landmark generation from speech publication-title: IEEE/ACM Trans Audio Speech Lang Process – volume: 33 start-page: 12449 year: 2020 end-page: 12460 ident: b17 article-title: wav2vec 2.0: A framework for self-supervised learning of speech representations publication-title: Adv Neural Inf Process Syst – reference: Zhou H, Sun Y, Wu W, Loy C, Wang X, Liu Z. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation. In: 2021 IEEE/CVF conference on computer vision and pattern recognition. 2021. – reference: Bregler C, Covell M, Slaney M. Video rewrite: Driving visual speech with audio. In: Proceedings of the 24th annual conference on computer graphics and interactive techniques. 1997, p. 353–60. – reference: Blanz V, Vetter T. A morphable model for the synthesis of 3D faces. In: Proceedings of the 26th annual conference on computer graphics and interactive techniques. 1999, p. 187–94. – volume: 36 start-page: 1 year: 2017 end-page: 194 ident: b25 article-title: Learning a model of facial shape and expression from 4D scans publication-title: ACM Trans Graph – reference: Cheng S, Kotsia I, Pantic M, Zafeiriou S. 4dfab: A large scale 4d database for facial expression analysis and biometric applications. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 5117–26. – year: 2018 ident: b20 article-title: Talking face generation by conditional recurrent adversarial network – volume: 38 year: 2019 ident: b33 article-title: Text-based editing of talking-head video publication-title: ACM Trans Graph – year: 2010 ident: b42 article-title: Dynamixyz: Markerless facial motion capture solution for achievers – volume: 1 start-page: 412 year: 2013 end-page: 418 ident: b29 article-title: Techniques for feature extraction in speech recognition system: A comparative study publication-title: International Journal Of Computer Applications In Engineering, Technology and Sciences – volume: 24 start-page: 3480 year: 2021 ident: 10.1016/j.cag.2024.103925_b21 article-title: Speech driven talking face generation from a single image and an emotion condition publication-title: IEEE Trans Multimed doi: 10.1109/TMM.2021.3099900 – start-page: 399 year: 2004 ident: 10.1016/j.cag.2024.103925_b49 – ident: 10.1016/j.cag.2024.103925_b38 doi: 10.1145/3382507.3418840 – ident: 10.1016/j.cag.2024.103925_b10 doi: 10.1109/CVPR.2019.01034 – volume: 40 issue: 6 year: 2021 ident: 10.1016/j.cag.2024.103925_b15 article-title: Live speech portraits: Real-time photorealistic talking-head animation publication-title: ACM Trans Graph doi: 10.1145/3478513.3480484 – ident: 10.1016/j.cag.2024.103925_b22 doi: 10.1007/978-3-030-58545-7_3 – volume: 38 issue: 4 year: 2019 ident: 10.1016/j.cag.2024.103925_b33 article-title: Text-based editing of talking-head video publication-title: ACM Trans Graph doi: 10.1145/3306346.3323028 – volume: 36 start-page: 1 issue: 4 year: 2017 ident: 10.1016/j.cag.2024.103925_b55 article-title: Audio-driven facial animation by joint end-to-end learning of pose and emotion publication-title: ACM Trans Graph doi: 10.1145/3072959.3073658 – year: 2015 ident: 10.1016/j.cag.2024.103925_b16 – ident: 10.1016/j.cag.2024.103925_b13 doi: 10.1109/ICCV.2017.116 – ident: 10.1016/j.cag.2024.103925_b26 doi: 10.1109/CVPRW.2017.287 – start-page: 173 year: 2016 ident: 10.1016/j.cag.2024.103925_b31 article-title: Deep speech 2: End-to-end speech recognition in english and mandarin – year: 2023 ident: 10.1016/j.cag.2024.103925_b51 – ident: 10.1016/j.cag.2024.103925_b5 doi: 10.1109/ICCV48922.2021.00384 – volume: 40 issue: 8 year: 2021 ident: 10.1016/j.cag.2024.103925_b48 article-title: Learning an animatable detailed 3D face model from in-the-wild images publication-title: ACM Trans Graph (Proc. SIGGRAPH) – ident: 10.1016/j.cag.2024.103925_b6 doi: 10.1145/258734.258880 – year: 2020 ident: 10.1016/j.cag.2024.103925_b24 – ident: 10.1016/j.cag.2024.103925_b18 doi: 10.1109/CVPR.2019.00802 – volume: 29 start-page: 3451 year: 2021 ident: 10.1016/j.cag.2024.103925_b35 article-title: Hubert: Self-supervised speech representation learning by masked prediction of hidden units publication-title: IEEE/ACM Trans Audio Speech Lang Process doi: 10.1109/TASLP.2021.3122291 – year: 2016 ident: 10.1016/j.cag.2024.103925_b41 – start-page: 1 year: 2023 ident: 10.1016/j.cag.2024.103925_b46 article-title: Robust facial marker tracking based on a synthetic analysis of optical flows and the YOLO network publication-title: Vis Comput – ident: 10.1016/j.cag.2024.103925_b7 doi: 10.1145/311535.311556 – ident: 10.1016/j.cag.2024.103925_b53 doi: 10.1007/978-3-030-01234-2_32 – year: 2018 ident: 10.1016/j.cag.2024.103925_b45 – ident: 10.1016/j.cag.2024.103925_b3 doi: 10.1145/3550469.3555399 – ident: 10.1016/j.cag.2024.103925_b11 doi: 10.1109/CVPR52688.2022.01821 – volume: 1 start-page: 412 year: 2013 ident: 10.1016/j.cag.2024.103925_b29 article-title: Techniques for feature extraction in speech recognition system: A comparative study publication-title: International Journal Of Computer Applications In Engineering, Technology and Sciences – year: 2018 ident: 10.1016/j.cag.2024.103925_b20 – volume: 10 year: 2009 ident: 10.1016/j.cag.2024.103925_b40 article-title: Dlib-ml: A machine learning toolkit publication-title: J Mach Learn Res – start-page: 4478 year: 2016 ident: 10.1016/j.cag.2024.103925_b44 article-title: Robust local optical flow: Long-range motions and varying illuminations – start-page: 3497 year: 2020 ident: 10.1016/j.cag.2024.103925_b32 article-title: Generative pre-training for speech with autoregressive predictive coding – volume: 30 year: 2017 ident: 10.1016/j.cag.2024.103925_b50 article-title: Attention is all you need publication-title: Adv Neural Inf Process Syst – start-page: 265 year: 2020 ident: 10.1016/j.cag.2024.103925_b28 article-title: Synthesising 3D facial motion from “in-the-wild” speech – volume: 61 start-page: 38 issue: 1 year: 1995 ident: 10.1016/j.cag.2024.103925_b47 article-title: Active shape models-their training and application publication-title: Comput Vis Image Underst doi: 10.1006/cviu.1995.1004 – volume: 33 start-page: 12449 year: 2020 ident: 10.1016/j.cag.2024.103925_b17 article-title: wav2vec 2.0: A framework for self-supervised learning of speech representations publication-title: Adv Neural Inf Process Syst – volume: 28 start-page: 27 year: 2020 ident: 10.1016/j.cag.2024.103925_b14 article-title: Noise-resilient training method for face landmark generation from speech publication-title: IEEE/ACM Trans Audio Speech Lang Process doi: 10.1109/TASLP.2019.2947741 – ident: 10.1016/j.cag.2024.103925_b12 doi: 10.1007/978-3-031-43148-7_29 – volume: 36 start-page: 1 issue: 6 year: 2017 ident: 10.1016/j.cag.2024.103925_b25 article-title: Learning a model of facial shape and expression from 4D scans publication-title: ACM Trans Graph – ident: 10.1016/j.cag.2024.103925_b39 doi: 10.1145/3429341.3429356 – start-page: 3407 year: 2020 ident: 10.1016/j.cag.2024.103925_b9 article-title: Learning formation of physically-based face attributes – year: 2019 ident: 10.1016/j.cag.2024.103925_b54 – volume: 20 start-page: 413 issue: 3 year: 2013 ident: 10.1016/j.cag.2024.103925_b8 article-title: Facewarehouse: A 3d facial expression database for visual computing publication-title: IEEE Trans Vis Comput Graphics – ident: 10.1016/j.cag.2024.103925_b36 doi: 10.1109/CVPR.2018.00537 – volume: 12 start-page: 591 issue: 6 year: 2010 ident: 10.1016/j.cag.2024.103925_b37 article-title: A 3-d audio-visual corpus of affective communication publication-title: IEEE Trans Multimed doi: 10.1109/TMM.2010.2052239 – year: 2016 ident: 10.1016/j.cag.2024.103925_b43 – ident: 10.1016/j.cag.2024.103925_b23 doi: 10.1109/CVPR46437.2021.00416 – ident: 10.1016/j.cag.2024.103925_b2 doi: 10.1145/3394171.3413532 – volume: 40 start-page: 1 issue: 3 year: 2021 ident: 10.1016/j.cag.2024.103925_b34 article-title: Iterative text-based editing of talking-heads using neural retargeting publication-title: ACM Trans Graph doi: 10.1145/3449063 – volume: 37 issue: 4 year: 2018 ident: 10.1016/j.cag.2024.103925_b1 article-title: Visemenet: Audio-driven animator-centric speech animation publication-title: ACM Trans Graph doi: 10.1145/3197517.3201292 – year: 2010 ident: 10.1016/j.cag.2024.103925_b42 – volume: 35 start-page: 1 issue: 4 year: 2016 ident: 10.1016/j.cag.2024.103925_b27 article-title: Jali: an animator-centric viseme model for expressive lip synchronization publication-title: ACM Trans Graph (TOG) doi: 10.1145/2897824.2925984 – year: 2018 ident: 10.1016/j.cag.2024.103925_b19 – start-page: 1 year: 2019 ident: 10.1016/j.cag.2024.103925_b30 article-title: Speech-driven facial animation by lstm-rnn for communication use – volume: 36 start-page: 1 issue: 4 year: 2017 ident: 10.1016/j.cag.2024.103925_b4 article-title: Synthesizing obama: learning lip sync from audio publication-title: ACM Trans Graph (ToG) doi: 10.1145/3072959.3073640 – year: 2018 ident: 10.1016/j.cag.2024.103925_b52 |
| SSID | ssj0002264 |
| Score | 2.3777554 |
| Snippet | Generating talking lips in sync with input speech has the potential to enhance speech communication and enable novel applications. This paper presents a system... |
| SourceID | crossref elsevier |
| SourceType | Enrichment Source Index Database Publisher |
| StartPage | 103925 |
| SubjectTerms | 3D talking meshes Landmarks Lip animation Viseme |
| Title | Audio-to-Deep-Lip: Speaking lip synthesis based on 3D landmarks |
| URI | https://dx.doi.org/10.1016/j.cag.2024.103925 |
| Volume | 120 |
| WOSCitedRecordID | wos001242618700001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 issn: 0097-8493 databaseCode: AIEXJ dateStart: 19950101 customDbUrl: isFulltext: true dateEnd: 99991231 titleUrlDefault: https://www.sciencedirect.com omitProxy: false ssIdentifier: ssj0002264 providerName: Elsevier |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9QwELZgywEOiKcoL_nACeQqD2cTc0ErtlVBVYXEHlZcosSPKm1Jon2g8u-ZsZ0HLUX0wMXaWMls5M_xjGfG3xDypgxiGWqMqWe8YFzphJXKxEyD6pSq1ElpLNJH6fFxtlyKL7681dqWE0jrOru4EO1_hRr6AGw8OnsDuHuh0AG_AXRoAXZo_wn42VZVDds0bK51y46qFjf9X1tty069O69aZCkAsw-ZSFCHKYwXxHP0XKrvxepsPbZXu6IPaztFLLv1KD3-wDubD7fVEOFxXfOmPlGNV4s2zdd5Wr_pn9vBC24VgOf-9q6HiA-Jft1yKkDFcVfisF9Oo2C0IGKk2Z1svrJWO7fBKezDT_ZQ-t5w7--82Jf0VZ9F2CWoneYgIkcRuRNxm-xEaSKyCdmZfdpffu5VM54adrSk7r27MLdN-Lv0Hn82VEbGx-IBue93DXTm0H5Ibun6Ebk34pJ8TD5cwf097VCngDrtUacWddrUNJ7THvUnZHGwv_h4yHx1DCYjkW6Y1FEaCBOXMpyaMCxMGhRCyxT5gYyZlomKsFiYzLIyKE1iIimDpCjwYLPmYGc-JZO6qfUzQtHkg9EKeZEoPo2kgAspAsnhOzbQ7JKgG4lceuZ4LGBynl-LwC552z_SOtqUv93Mu-HNvd3n7Lkcpsr1jz2_yX-8IHeHGfySTDarrX5F7sgfm2q9eu3nyS-eV3KX |
| linkProvider | Elsevier |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Audio-to-Deep-Lip%3A+Speaking+lip+synthesis+based+on+3D+landmarks&rft.jtitle=Computers+%26+graphics&rft.au=Fang%2C+Hui&rft.au=Weng%2C+Dongdong&rft.au=Tian%2C+Zeyu&rft.au=Ma%2C+Yin&rft.date=2024-05-01&rft.issn=0097-8493&rft.volume=120&rft.spage=103925&rft_id=info:doi/10.1016%2Fj.cag.2024.103925&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_cag_2024_103925 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0097-8493&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0097-8493&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0097-8493&client=summon |