3D人臉動畫模型建立及唇形語音同步在人機互動系統之應用

Chien-Chieh Huang; 黃健桀

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/45131

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	羅仁權(Ren C. Luo)
dc.contributor.author	Chien-Chieh Huang	en
dc.contributor.author	黃健桀	zh_TW
dc.date.accessioned	2021-06-15T04:05:42Z	-
dc.date.available	2016-08-22
dc.date.copyright	2011-08-22
dc.date.issued	2011
dc.date.submitted	2011-08-17
dc.identifier.citation	[1]P. Ekman, and W. V. Friesen, “Facial Action Coding System: A Technique for the Measurement of Facial Movement,” Consulting Psychologists Press, Palo Alto, CA, 1978. [2]MPEG-4 Systems, ISO/IEC N2201, 1998. [3]M. Escher, I. Pandzic, and N. M. Thalmann, “Facial deformations for MPEG-4,” Proc. Computer Animation, pp.56-62, PA, 1998. [4]S. Kshirsagar, S. Garchery, and N. M. Thalmann, “Feature point based mesh deformation applied to MPEG-4 facial animation,” Proc. Deform, Workshop on Virtual Humans, pp.23–34, 2000. [5]K. Waters, “A muscle model for animating three-dimensional facial expression,” SIGGRAPH Comput. Graph., vol.21, pp.17–24, 1987. [6]S. M. Platt, and N. I. Badler, “Animating facial expression,” SIGGRAPH Comput. Graph., vol.15, pp.245–252, 1981. [7]S. M. Platt, “A structural model of the human face,” Ph.D. dissertation, University of Pennsylvania, USA, 1985. [8]K. Kähler, J. Haber, and H. P. Seidel, “Geometry-based muscle modeling for facial animation,” Proc. Graphics Interface, pp. 27-36, 2001. [9]D. Terzopoulos, and K. Waters, “Physically-based facial modeling, analysis, and animation,” J. Visual. Comput. Animation, vol. 1, pp.73–80, 1990. [10]Y. C. Lee, D. Terzopoulos, and K. Waters, “Constructing physics-based facial models of individuals,” Proc. Graphics Interface’93, 1993. [11]Y. C. Lee, D. Terzopoulos, and K. Waters, “Realistic modeling for facial animation,” Proc. Ann. Conf. Series, SIGGRAPH, pp. 55–62, 1995. [12]C. J. Kuo, R. S. Huang, and T. G. Lin, “Synthesizing lateral face from frontal facial image using anthropometric estimation,” Proc. Image Process., vol. 1, pp. 133–136, 1997. [13]D. DeCarlo, D. Metaxas, and M. Stone, “An anthropometric face model using variational technique,” Proc. SIGGRAPH, pp. 67-74, 1998. [14]S. J. Gortler, and M. F. Cohen, “Hierarchical and variational geometric modeling with wavelets,” Symp. Interactive 3D Graphics, pp.35–42, 1995. [15]W. Welch, and A. Witkin, “Variational surface modeling,” Proc. SIGGRAPH, vol. 26, pp.157–166, 1992. [16]F. Pighin, J. Hecker, D. Lischinski, R. Szeliski, and D. H. Salesin, “Synthesizing realistic facial expressions from photographs,” Proc. SIGGRAPH, pp.75–84, July 1998. [17]F. Ulgen, “A step toward universal facial animation via volume morphing,” Proc. 6th IEEE Int. Workshop on Robot and Human Commun., pp.358–363, 1997. [18]B. Guenter, C. Grimm, D. Wood, H. Malvar, and F. Pighin, “Making faces,” Proc. SIGGRAPH’98, pp.55–66, July 1998. [19]M. J. D. Powell, “Radial basis functions for multivariate interpolation: A review,” in Algorithms for Approximation, J. C. Mason and M. G. Cox, Eds. Oxford, U.K.: Oxford University Press, pp. 143-167, 1987. [20]T. Poggio, and F. Girosi, “A theory of networks for approximation and learning,” A.I. Memo 1140, Mass. Inst. Tech., 1989. [21]V. Blanz, and T. Vetter, “A morphable model for the synthesis of 3d faces,” Proc. SIGGRAPH, ACM Press,pp. 187-194, 1999. [22]P. Rubin, T. Bear, and P. Mermelstein, “An articulatory synthesizer for perceptual research,” J. Acoust. Soc. Am., pp.321–328, 1981. [23]J. Allen, M. S. Hunnicutt, and D. Klatt, From Text to Speech: The MITalk system.Cambridge, U.K.: Cambridge Univ. Press, 1987. [24]J. P. H. van Santen, R. W. Sproat, J. P. Olive, and J. Hirschberg, Eds., Progress in Speech Synthesis.New York: Springer, 1997. [25]J. P. H. van Santen, “Assignment of segmental duration in text-to-speech synthesis,” Comput. Speech and Language, pp. 95-128, 1994. [26]L. F. Lamel, J. L. Gauvain, B. Prouts, C. Bouhier, and R. Boesch, “Generation and Synthesis of Broadcast Messages,” Proc. ESCA-NATO Workshop on Applications of Speech Technology, 1993. [27]A. W. Black, “Perfect synthesis for all of the people all of the time,” IEEE TTS Workshop, 2002. [28]J. Kominek, and A. W. Black, “CMU ARCTIC databases for speech synthesis,” Lang. Technol. Inst., Carnegie Mellon Univ., Pittsburgh, PA, 2003 [Online]. Available: http://festvox.org/cmu_arctic/index.html [29]J. Zhang, “Language generation and speech synthesis in dialogues for language learning,” M.S. thesis, Dept. Elect. Eng., M.I.T., 2004. [30]K. N. Stevens, and A. S. House, “Speech perception (Acoustic model and linguistic, syntactic, lexical and semantic factors in speech perception and production process),” in Foundations of modern auditory theory, vol. 2, New York, Academic Press, pp. 3-62, 1972. [31]H. Zen, T. Nose, J. Yamagishi, S. Sako, and K. Tokuda, “The HMMbased speech synthesis system (HTS) version 2.0,” Proc. 6th ISCA Workshop Speech Synth., pp. 294–299, Aug. 2007. [32]L. R. Rabiner, “A tutorial on Hidden Markov Models and selected applications in speech recognition,” Proc. IEEE, vol. 77, pp. 257–285, Feb. 1989. [33]M. A. Przybocki, and A. F. Martin, “NIST speaker recognition evaluation chronicles,” Proc. Odyssey 2004: Speaker Lang. Recognition Workshop, Toledo, Spain, pp. 15–22, Jun. 2004. [34]Handbook of the International Phonetic Association, International Phonetic Association, Cambridge Univ. Press, 1999. [35]T. Crowley, An introduction to historical linguistics. Oxford Univ. Press, 1992. [36]C. G. Fisher, “Confusions among visually perceived consonants,” J. Speech Hearing Res., vol. 11, pp. 796–803, 1968. [37]T. Chen, “Audiovisual speech processing. Lip reading and lip synchronization,” IEEE Signal Processing Mag., vol. 18, pp. 9–21, Jan. 2001. [38]H. McGurk, and J. MacDonald, “Hearing lips and seeing voices,” Nature, vol. 264, pp. 746–748, 1976. [39]E. Owens, and B. Blazek, “Visemes observed by hearing-impaired and normal hearing adult viewers,” J. Speech Hearing Res., vol. 28, pp. 381–393, 1985.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/45131	-
dc.description.abstract	在21世紀，目前世界上智慧型機器人是一個相當重要的產業，有愈來愈多的機構在開發先進的、多功能的智慧型機器人，例如輪型機器人、雙足機器人。隨著老年人口的增加與目前社會的經濟壓力，大多數的父母都需要外出工作，基於這個原因及現象，我們做了一個應用給家中的小孩與老人家們。人機互動介面在智慧型機器人領域一直是一個重要的技術，而在此我們使用語音與人聲來當做控制方式來與機器人溝通，在這篇論文中包含了兩個主要的部份，一個是頭部模型的建立，另一個是語音處理。語音與唇形的同步包含了電腦視覺、語音合成、語音辨識等等的技術，我們提出一個方法來達到語音與唇形的同步，利用的是微軟公司所開發的程式，語音應用程式介面 (SAPI) 來當做我們語音合成與辨識的工具。語音動畫包含了兩個部份，語音與唇形畫面。至於語音合成的輸出是從文字轉語音 (TTS) 的程式得來，而唇形畫面是由軟體 (FaceGen Modeller) 所合成的。藉由輸入三張主要的照片，左側、右側、正面，再經過校正點的校正，我們能夠得到一個與照片人物相近的3D人臉模形。使用C#當中的語法來連接唇形畫面與對應的視素 (viseme) 關係，依照視素的排序關係來匯入對應的唇形畫面。目前語音合成的主要應用大多是當做輔助工具，例如說，當做視覺有障礙的人的螢幕閱讀器，幫助他們閱讀。或是，一個無法說話的人，可以藉由語音合成來與其它人溝通。而到了近幾年，語音合成被廣泛的應用在服務型機器人與娛樂型的產品，比如語言的學習、教育方面的功能、影音遊戲方面、動畫方面、音樂方面。最後，我們建立了一個快速的方法來產生3D頭部模型，並同時讓他與語音同步。這個應用可以使用在教育小孩英語閱讀與英語聽力，對於一些特定的人們，比如聾啞人士，可以利用這個程式來當做溝通的工具。	zh_TW
dc.description.abstract	In 21st century, the intelligent robotics becomes one of the most essential industries all over the world. There are many intelligent robotics institutions develop modern and multi-functional robots in many types, for example, wheel robot, and biped robot. With the growing of elders and economic pressure of present society, most of the parents both have to work for their family. Because of this phenomenon, we made an application for the children and the elders. Human-robot interaction (HRI) is an important technology in intelligent robotics field. In this thesis, we use sound and voice as commands to communicate with robots. It consists of two major parts, namely, head modeling and speech processing. Synchronization between speech and mouth shape includes technologies, such as computer vision, speech synthesis, and speech recognition. We present a method to synchronize the lip movement and the speech, and we use Microsoft’s Speech Application Programming Interface (SAPI) as the speech synthesis and recognition tool. Speech animation includes two components, the speech and the image. Speech synthesis output is obtained from Text-to-Speech (TTS), and the images of visemes are generated from software, FaceGen Modeller. Import three key pictures to this software to calibrate and generate the face model. The viseme event handler in C# will connect the image of mouth shape and viseme together. Load the images sequentially and the visemes will one by one match with the images correctly. The main applications of speech synthesis are used as assistive devices, e.g. the use of screen readers for people with visual impairment. A mute person can take advantage of this technology to talk to others. In recent years, speech synthesis is extensively applied in service robotics and entertainment productions such as language learning, education, video games, animations, and music videos. Finally, we build a quick method to make a 3D head model and synchronize it with speech. This application can be used to educate children English reading and listening. For some specific people, like mute people and deaf people, this application can be used as a communication tool.	en
dc.description.provenance	Made available in DSpace on 2021-06-15T04:05:42Z (GMT). No. of bitstreams: 1 ntu-100-R97921047-1.pdf: 3122323 bytes, checksum: f8bef24bcb40a554952fe6d20059494e (MD5) Previous issue date: 2011	en
dc.description.tableofcontents	誌謝 I 中文摘要 III ABSTRACT IV TABLE OF CONTENTS VI LIST OF FIGURES VIII LIST OF TABLES IX CHAPTER 1 INTRODUCTION 1 1.1 ROBOT GENERATION 1 1.2 HUMAN-ROBOT INTERACTION 1 1.3 SPEECH ANIMATION 2 1.4 APPLICATIONS 4 1.5 ORGANIZATION 4 CHAPTER 2 PREVIOUS AND RELATED WORK 6 2.1 LIP SYNCHRONIZATION APPROACHES 6 2.2 FACIAL ACTION CODING SYSTEM (FACS) 6 2.3 MPEG-4 FACIAL ANIMATION 10 2.4 VISUAL SPEECH ANIMATION 13 CHAPTER 3 HUMAN HEAD MODELING 15 3.1 HEAD MODELING TECHNIQUES 15 3.1.1 Laser Scan Method 15 3.1.2 Photographic Method 16 3.2 PHYSICS-BASED MUSCLE MODELING 17 3.2.1 Vector Muscle 17 3.2.2 Spring Mesh muscle 18 3.2.3 Layered Spring Mesh Muscle 18 3.3 3D FACE MODELING 19 3.3.1 Anthropometry 19 3.3.2 Person-Specific Model Creation 20 CHAPTER 4 SPEECH SIGNAL PROCESSING 22 4.1 SPEECH SYNTHESIS (SS) 22 4.1.1 Text-to-Speech (TTS) 23 4.1.2 Synthesizer Technology 24 4.1.3 Concatenative Synthesis 25 4.1.4 Formant Synthesis 27 4.1.5 Articulatory Synthesis 27 4.1.6 HMM-based Synthesis 28 4.2 SPEECH RECOGNITION (SR) 28 4.2.1 Algorithms 29 4.2.2 Hidden Markov Model 30 4.2.3 Performance Criterion 33 4.2.4 Applications 34 4.3 MICROSOFT SPEECH APPLICATION PROGRAMMING INTERFACE (SAPI) 35 4.3.1 Basic architecture 37 4.3.2 SAPI version 5 38 4.3.3 SAPI 5.1 and SAPI 5.3 39 CHAPTER 5 LIP SYNCHRONIZATION 40 5.1 PHONEME 40 5.2 COARTICULATION 41 5.3 VISEME 42 5.4 MCGURK EFFECT 43 5.5 PHONEMES AND VISEMES ASSIGNMENT 44 CHAPTER 6 SCENARIO APPLICATIONS 45 6.1 USUAL TALKS 45 6.2 A PRESCRIPTION 47 6.3 FEELING QUEASY 49 CHAPTER 7 RESULTS OF LIP-SYNC SPEECH ANIMATION 51 7.1 SPEECH SYNTHESIS AND SPEECH RECOGNITION 51 7.2 3D HEAD MODEL 54 7.3 LIP SYNCHRONIZATION ANIMATION 55 7.3.1 Facial Expressions 57 CHAPTER 8 CONCLUSIONS AND CONTRIBUTIONS 60 8.1 CONCLUSIONS 60 8.2 CONTRIBUTIONS 62 CHAPTER 9 FUTURE WORKS 64 REFERENCES 65 VITA 68
dc.language.iso	en
dc.subject	臉部動畫	zh_TW
dc.subject	唇形同步	zh_TW
dc.subject	語音辨識	zh_TW
dc.subject	語音合成	zh_TW
dc.subject	3D頭部模型	zh_TW
dc.subject	facial animation	en
dc.subject	lip synchronization	en
dc.subject	speech recognition	en
dc.subject	speech synthesis	en
dc.subject	3D head model	en
dc.title	3D人臉動畫模型建立及唇形語音同步在人機互動系統之應用	zh_TW
dc.title	3D Facial Modeling and Animation with Speech / Lip Synchronization for Human-Robot Interactions	en
dc.type	Thesis
dc.date.schoolyear	99-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	馮蟻剛(I-Kong Fong),鄒杰烔(Jie-Tong Zou)
dc.subject.keyword	唇形同步,語音辨識,語音合成,3D頭部模型,臉部動畫,	zh_TW
dc.subject.keyword	lip synchronization,speech recognition,speech synthesis,3D head model,facial animation,	en
dc.relation.page	69
dc.rights.note	有償授權
dc.date.accepted	2011-08-17
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電機工程學研究所	zh_TW
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-100-1.pdf 未授權公開取用	3.05 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。