請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98860完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 張瑞益 | zh_TW |
| dc.contributor.advisor | Ray-I Chang | en |
| dc.contributor.author | 鄭新曄 | zh_TW |
| dc.contributor.author | Hsin-Yeh Cheng | en |
| dc.date.accessioned | 2025-08-19T16:28:48Z | - |
| dc.date.available | 2025-08-20 | - |
| dc.date.copyright | 2025-08-19 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-08-12 | - |
| dc.identifier.citation | [1] United Nations General Assembly, “United Nations Declaration on the Rights of Indigenous Peoples.” 2007. [Online]. Available: https://www.un.org/development/desa/indigenouspeoples/wp-content/uploads/sites/19/2018/11/UNDRIP_E_web.pdf
[2] Legislative Yuan, Republic of China (Taiwan), “Education Act for Indigenous Peoples.” 1998. [Online]. Available: https://law.moj.gov.tw/ENG/LawClass/LawAll.aspx?pcode=H0020037 [3] Legislative Yuan, Republic of China (Taiwan), “Development of National Languages Act.” 2019. [Online]. Available: https://law.moj.gov.tw/ENG/LawClass/LawAll.aspx?pcode=H0170143 [4] 羅志成, “原住民重點學校之族語課程實施現況與成效之研究-以新竹縣竹東鎮國小為例,” Master’s Thesis, 中華大學, 新竹, 2023. [Online]. Available: https://hdl.handle.net/11296/9ttj57 [5] A. Debnath, S. S. Patil, G. Nadiger, and R. A. Ganesan, “Low-resource end-to-end sanskrit tts using tacotron2, waveglow and transfer learning,” in 2020 IEEE 17th India Council International Conference (INDICON), IEEE, 2020, pp. 1–5. [6] D. Zeinalov, B. Sen, and F. Aslanova, “Text-to-Speech in Azerbaijani Language via Transfer Learning in a Low Resource Environment,” in Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024), M. Abbas and A. A. Freihat, Eds., Trento: Association for Computational Linguistics, Oct. 2024, pp. 434–438. [Online]. Available: https://aclanthology.org/2024.icnlsp-1.44/ [7] A. B. Mathew, A. S. Kumar, F. K. Dominic, G. V. Pereira, and J. Mathew, “Malayalam TTS Using Tacotron2, WaveGlow and Transfer Learning,” in 2023 Annual International Conference on Emerging Research Areas: International Conference on Intelligent Systems (AICERA/ICIS), IEEE, 2023, pp. 1–5. [8] F. Zhuang et al., “A comprehensive survey on transfer learning,” Proc. IEEE, vol. 109, no. 1, pp. 43–76, 2020. [9] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” J. Big Data, vol. 3, pp. 1–40, 2016. [10] O. Day and T. M. Khoshgoftaar, “A survey on heterogeneous transfer learning,” J. Big Data, vol. 4, no. 1, pp. 1–42, 2017. [11] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Acoust. Speech Signal Process., vol. 28, no. 4, pp. 357–366, 1980. [12] S. S. Stevens, J. Volkmann, and E. B. Newman, “A scale for the measurement of the psychological magnitude pitch,” JASA, vol. 8, no. 3, pp. 185–190, 1937. [13] S. Sigurdsson, K. B. Petersen, and T. Lehn-Schiøler, “Mel Frequency Cepstral Coefficients: An Evaluation of Robustness of MP3 Encoded Music.,” in ISMIR, 2006, pp. 286–289. [14] H. Purwins, B. Li, T. Virtanen, J. Schlüter, S.-Y. Chang, and T. Sainath, “Deep learning for audio signal processing,” IEEE J. Sel. Top. Signal Process., vol. 13, no. 2, pp. 206–219, 2019. [15] L. Wyse, “Audio spectrogram representations for processing with convolutional neural networks,” ArXiv Prepr. ArXiv170609559, 2017. [16] J. Shen et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2018, pp. 4779–4783. [17] Y. Wang et al., “Tacotron: Towards end-to-end speech synthesis,” ArXiv Prepr. ArXiv170310135, 2017. [18] D. Griffin and J. Lim, “Signal estimation from modified STFT,” IEEE Trans. Acoust. Speech Signal Process., vol. 32, no. 2, pp. 236–243, 1984. [19] A. Van Den Oord et al., “Wavenet: A generative model for raw audio,” ArXiv Prepr. ArXiv160903499, vol. 12, p. 1, 2016. [20] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A flow-based generative network for speech synthesis,” in ICASSP, IEEE, 2019. [21] 李政翰, “基於機器學習之動畫元件生成系統於簡報插圖之應用,” Master’s Thesis, 國立臺灣大學, 2022. [22] C. Bregler, M. Covell, and M. Slaney, “Video Rewrite: driving visual speech with audio,” in Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, in SIGGRAPH ’97. USA: ACM Press/Addison-Wesley Publishing Co., 1997, pp. 353–360. doi: 10.1145/258734.258880. [23] K. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 484–492. [24] S. Mukhopadhyay, S. Suri, R. T. Gadde, and A. Shrivastava, “Diff2lip: Audio conditioned diffusion models for lip-synchronization,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 5292–5302. [25] Y. Zhang et al., “MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting,” ArXiv E-Prints, p. arXiv-2410, 2024. [26] J. Ho et al., “Imagen video: High definition video generation with diffusion models,” ArXiv Prepr. ArXiv221002303, 2022. [27] A. Gupta et al., “Photorealistic video generation with diffusion models,” in European Conference on Computer Vision, Springer, 2024, pp. 393–411. [28] U. Singer et al., “Make-a-video: Text-to-video generation without text-video data,” ArXiv Prepr. ArXiv220914792, 2022. [29] K. Azizah, M. Adriani, and W. Jatmiko, “Hierarchical transfer learning for multilingual, multi-speaker, and style transfer DNN-based TTS on low-resource languages,” IEEE Access, vol. 8, pp. 179798–179812, 2020. [30] T. Tu, Y.-J. Chen, C. Yeh, and H.-Y. Lee, “End-to-end text-to-speech for low-resource languages by cross-lingual transfer learning,” ArXiv Prepr. ArXiv190406508, 2019. [31] P. J. Li, “Reconstruction of proto-Atayalic phonology,” Bull. Inst. Hist. Philol. Acad. Sin., vol. 52, no. 2, pp. 235–301, 1981. [32] R. Blust, “The Austronesian languages (revised edition),” Canberra ANU-Asia Pac. Linguist., 2013. [33] A. Goderich, “Atayal phonology, reconstruction, and subgrouping,” PhD Diss Natl. Tsing Hua Univ., 2020. [34] G. Intlekofer Darlene, “Tagalog/u/-lowering: An instrumental study of spontaneous speech,” First Qualif. Pap. CUNY Grad. Cent. Httpdgintlekofer Ws Gc Cuny Edufiles201406TagalogLowering Pdf, 2012. [35] H. J. Huang, “Syllable types in Bunun, Saisiyat, and Atayal,” New Adv. Formos. Linguist., pp. 47–74, 2015. [36] R. K. Lim, “Nonlinear phonological analysis in assessment of phonological development in Tagalog,” PhD Thesis, University of British Columbia, 2014. [37] A. Goderich, “Glottal Stop Alternations in Plngawan Atayal,” 臺灣語文研究, vol. 19, no. 1, pp. 103–131, 2024. [38] C. Moseley, Ed., Atlas of the World’s Languages in Danger, 3rd ed. Paris: UNESCO Publishing, 2010. [Online]. Available: https://unesdoc.unesco.org/ark:/48223/pf0000187026 [39] D. M. Eberhard, G. F. Simons, and C. D. Fennig, “Ethnologue: Languages of the World, 26th edition.” SIL International, 2025. [Online]. Available: https://www.ethnologue.com/insights/ethnologue200/ [40] K. Ito, “The LJ Speech Dataset 2017.” 2017. [Online]. Available: https://keithito.com/LJ-Speech-Dataset/ [41] M. D. Technology, “Filipino Scripted Speech Corpus – Daily Use Sentence.” 2023. [Online]. Available: https://magichub.com/datasets/filipino-scripted-speech-corpus-daily-use-sentence/ [42] S. I. L. International, “BLOOM-Speech: A Multilingual Speech Dataset.” 2025. [Online]. Available: https://huggingface.co/datasets/sil-ai/bloom-speech [43] Y. Zhao, X. Wang, L. Juvela, and J. Yamagishi, “Transferring Neural Speech Waveform Synthesizers to Musical Instrument Sounds Generation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6269–6273. doi: 10.1109/ICASSP40776.2020.9053569. [44] L. Makashova, “Speech Synthesis and Recognition for a Low-Resource Language: Connecting TTS and ASR for Mutual Benefit,” Master’s Thesis, University of Gothenburg, 2021. [Online]. Available: https://gupea.ub.gu.se/handle/2077/69692 [45] A. Dasare, P. Hegde, S. Shetty, and D. KT, “The Role of Formant and Excitation Source Features in Perceived Naturalness of Low Resource Tribal Language TTS: An Empirical Study,” in Proc. Interspeech 2023, 2023, pp. 531–535. [46] R. Gal, A. Haviv, Y. Alaluf, A. H. Bermano, D. Cohen-Or, and G. Chechik, “Comfygen: Prompt-adaptive workflows for text-to-image generation,” ArXiv Prepr. ArXiv241001731, 2024. [47] X. Xue, Z. Lu, D. Huang, W. Ouyang, and L. Bai, “GenAgent: Build Collaborative AI Systems with Automated Workflow Generation–Case Studies on ComfyUI,” ArXiv E-Prints, p. arXiv-2409, 2024. [48] ITU-T, “Methods for subjective determination of transmission quality (P.800).” 1996. [Online]. Available: https://www.itu.int/rec/T-REC-P.800-199608-I/en [49] R. Streijl, S. Winkler, and D. S. Hands, “Mean Opinion Score (MOS) Revisited: Methods and Applications, Limitations and Alternatives,” Multimed. Syst., vol. 22, no. 2, pp. 213–227, 2016, doi: 10.1007/s00530-014-0396-1. [50] A. Kirkland, S. Mehta, H. Lameris, G. E. Henter, E. Székely, and J. Gustafson, “Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation,” in 12th Speech Synthesis Workshop (SSW) 2023, 2023. [51] J. B. Ganhinhin, M. D. B. Varona, C. R. G. Lucas, and A. A. Aquino, “Voice conversion of tagalog synthesized speech using cycle-generative adversarial networks (cycle-gan),” in 2022 IEEE 12th International Conference on Control System, Computing and Engineering (ICCSCE), IEEE, 2022, pp. 103–106. [52] S. Moon, S. Kim, and Y.-H. Choi, “Mist-tacotron: End-to-end emotional speech synthesis using mel-spectrogram image style transfer,” IEEE Access, vol. 10, pp. 25455–25463, 2022. [53] R. Liu, Y. Hu, H. Zuo, Z. Luo, L. Wang, and G. Gao, “Text-to-Speech for Low-Resource Agglutinative Language With Morphology-Aware Language Model Pre-Training,” IEEEACM Trans. Audio Speech Lang. Process., vol. 32, no. 1, pp. 1075–1087, Jan. 2024, doi: 10.1109/TASLP.2021.3076369. [54] R. Kubichek, “Mel‐Cepstral Distance Measure for Objective Speech Quality Assessment,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1993, pp. 125–128. doi: 10.1109/ICASSP.1993.319752. [55] G. Cong et al., “Learning to Dub Movies via Hierarchical Prosody Models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 14687–14697. doi: 10.1109/CVPR52729.2023.01411. [56] H. Sakoe and S. Chiba, “Dynamic Programming Algorithm Optimization for Spoken Word Recognition,” IEEE Trans. Acoust. Speech Signal Process., vol. 26, no. 1, pp. 43–49, 1978, doi: 10.1109/TASSP.1978.1163055. [57] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications,” IEICE Trans. Inf. Syst., vol. E99.D, no. 7, pp. 1877–1884, 2016. [58] X. Zhou, M. Zhang, Y. Zhou, Z. Wu, and H. Li, “Accented Text-to-Speech Synthesis With Limited Data,” IEEEACM Trans. Audio Speech Lang. Process., vol. 32, pp. 1699–1711, 2024, doi: 10.1109/TASLP.2024.3363414. [59] H. Zen, A. Senior, and M. Schuster, “Statistical Parametric Speech Synthesis Using Deep Neural Networks,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 7962–7966. doi: 10.1109/ICASSP.2013.6639215. [60] S. Maiti and M. I. Mandel, “Speech Denoising by Parametric Resynthesis,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6995–6999. doi: 10.1109/ICASSP.2019.8682767. [61] Y. Wu, Y. Yu, J. Shi, T. Qian, and Q. Jin, “A Systematic Exploration of Joint-Training for Singing Voice Synthesis,” in 2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2024, pp. 289–293. doi: 10.1109/ISCSLP63861.2024.10799952. [62] P. Manocha and A. Kumar, “Speech Quality Assessment through MOS using Non-Matching References,” in Proceedings of Interspeech 2022, Incheon, Korea: ISCA, Sept. 2022, pp. 654–658. doi: 10.21437/Interspeech.2022-407. [63] H. Chung, S.-H. Lee, and S.-W. Lee, “Reinforce-aligner: Reinforcement alignment search for robust end-to-end text-to-speech,” ArXiv Prepr. ArXiv210602830, 2021. [64] X. Zhu, Y. Zhang, S. Yang, L. Xue, and L. Xie, “Pre-alignment guided attention for improving training efficiency and model stability in end-to-end speech synthesis,” IEEE Access, vol. 7, pp. 65955–65964, 2019. [65] J. Renfeng, Y. Gang, and S. Qi, “The Motivational Impact of GenAI Tools in Language Learning: a Quasi-Experiment Study,” Int. J. Appl. Linguist., 2025. [66] L. Bessadet, “Drama-Based Approach in English Language Teaching.,” Arab World Engl. J., vol. 13, no. 1, pp. 525–533, 2022. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/98860 | - |
| dc.description.abstract | 由於台灣原住民族語屬於稀少性語言、資源有限,導致語言保存與教學難以推行。近年來,生成式人工智慧應用日新月異,也廣泛運用於教學領域,因此本論文針對台灣泰雅族(Atayal)的主要方言賽考利克泰雅族語(Squliq Atayal)開發文字轉語音(Text-to-speech, TTS)模型,結合口型同步(lip Synchronization)技術,以生成式人工智慧建立口說教學影片生成系統,降低其教學影片準備之人力與時間成本。近年來針對常見語言所設計的TTS模型,其結果雖已接近一般人的說話品質,然而語音模型多受限於語料品質,對於像台灣原住民族語這類稀少性語言,因語料十分稀缺,其 TTS 模型之發音流暢性與正確率不佳。因此我們提出多階段遷移學習(multi-stage transfer learning),使用大量英文語音資料預先訓練的 Tacotron 2 模型,再以同為南島語系、資料量較大的菲律賓他加祿語(Tagalog)進行第一階段遷移學習,最後使用現有少量泰雅族語料完成第二階段遷移學習,結合泰雅族語微調聲碼器(vocoder)。實驗結果發現,相較於直接遷移學習,我們提出的多階段遷移學習方法可在僅有少量台灣原住民語料下,大幅提昇其 TTS 模型之發音流暢性與正確率,以5分制的平均意見分數(mean opinion score,MOS)評估,整體評分提升0.26分、發音正確性更提升0.39分。整合所提出之技術,我們於 ComfyUI 平台建置一個生成式人工智慧之教學影片生成系統,在影片的口型同步部份使用 MuseTalk,將講者影像和TTS模型發音之音檔整合,生成嘴型與語音自然流暢的台灣原住民族語口說教學影片。本論文所提出之多階段遷移學習方法顯示可以少量語料建立稀少性語言TTS 模型,並以實例呈現導入生成式人工智慧之教學影片生成系統於教育應用的效果與潛力。 | zh_TW |
| dc.description.abstract | Taiwan's Indigenous languages are classified as low-resource and endangered, posing significant challenges to both language preservation and instructional development. With the rapid advancement of generative artificial intelligence (GenAI), educational applications have gained increasing attention and investment. This study develops a text-to-speech (TTS) model for Squliq Atayal, the primary dialect of the Atayal language in Taiwan, and integrates lip synchronization techniques to construct a GenAI-driven video generation system for spoken language instruction. The goal is to reduce the labor and time involved in producing teaching materials. While mainstream TTS models for high-resource languages such as English have achieved near-human speech quality, their performance heavily depends on large and high-quality speech corpora. For low-resource languages like those spoken by Taiwan's Indigenous communities, limited audio datasets often result in poor fluency and pronunciation accuracy. To mitigate these limitations, we proposed a multi-stage transfer learning framework: we first utilized a pre-trained Tacotron 2 model trained on a large-scale English corpus, then applied intermediate transfer learning using Tagalog, a related Austronesian language with more abundant data, before conducting a final fine-tuning stage using the available Atayal data. The vocoder is also fine-tuned using Atayal speech. Experimental results showed that, compared to single-stage transfer learning, the proposed approach significantly improved both fluency and pronunciation accuracy, even with limited Indigenous language data. Evaluated by a 5-point Mean Opinion Score (MOS), the proposed model achieved a 0.26 improvement in overall quality and a 0.39 increase in pronunciation accuracy. By integrating the proposed methods, we further implemented a GenAI-driven video generation system on the ComfyUI platform, using MuseTalk for lip synchronization to generate videos for oral language instruction of Taiwan's Indigenous languages in which the speaker’s mouth movements are naturally aligned with the synthesized speech. These findings demonstrate that the proposed multi-stage transfer learning framework can build TTS models for low-resource languages using minimal data and highlight the potential of deploying GenAI-driven video generation systems in educational settings. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-08-19T16:28:48Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-08-19T16:28:48Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 口試委員會審定書 i
誌謝 ii 中文摘要 iii ABSTRACT iv CONTENTS vi LIST OF FIGURES ix LIST OF TABLES x Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Problem Statement 2 1.3 Objectives and Contributions 3 1.4 Outline 4 Chapter 2 Background Knowledge 5 2.1 Transfer Learning 5 2.2 Mel Spectrogram 6 2.3 Tacotron 2 7 2.3.1 Encoder 8 2.3.2 Attention Mechanism 9 2.3.3 Decoder 10 2.3.4 Post-net 10 2.4 Vocoder 11 2.4.1 Griffin-Lim Algorithm 11 2.4.2 WaveNet 12 2.4.3 WaveGlow 12 2.5 Lip Synchronization 14 2.6 Video Generation 15 Chapter 3 Proposed Methods 18 3.1 System Structure 18 3.2 Multi-Stage Transfer Learning 19 3.2.1 Pretraining and Transfer Learning Techniques 19 3.2.2 Multi-Stage Transfer Learning Strategy 20 3.3 Dataset 21 3.3.1 Cross-Linguistic Comparison of Tagalog and Atayal 21 3.3.2 Speech Data for Training 23 3.4 WaveGlow Adaptation to Atayal Speech 24 Chapter 4 Experimental Setup 25 4.1 Data Preprocessing 25 4.1.1 Audio Preprocessing 25 4.1.2 Text Preprocessing 26 4.2 Model Pretraining and Transfer Learning 27 4.2.1 Transfer Learning Setup and Hyperparameter Settings 27 4.2.2 WaveGlow Vocoder Adaptation Settings 27 4.3 System Implementation on ComfyUI 28 4.4 Subjective Evaluation in Mean Opinion Score 29 4.5 Objective Evaluation Metrics 30 Chapter 5 Results and Discussion 33 5.1 System Execution and Performance 33 5.1.1 System Interface and Functional Nodes 33 5.1.2 Computational Efficiency in Generating Videos 34 5.2 Audio Output Evaluation 36 5.2.1 Subjective Evaluation 36 5.2.2 Objective Evaluation 37 5.3 TTS Model Evaluation 38 5.3.1 Results and Selection of Hyperparameter Settings 38 5.3.2 Multi-Stage Transfer Learning Results 40 Chapter 6 Conclusion and Future Work 42 6.1 Summary of Contributions 42 6.2 Future Work 43 REFERENCES 45 | - |
| dc.language.iso | en | - |
| dc.subject | Tacotron 2 | zh_TW |
| dc.subject | Multi-stage transfer learning | zh_TW |
| dc.subject | Text-to-speech | zh_TW |
| dc.subject | Lip synchronization | zh_TW |
| dc.subject | MuseTalk | zh_TW |
| dc.subject | Tacotron 2 | en |
| dc.subject | MuseTalk | en |
| dc.subject | Lip synchronization | en |
| dc.subject | Multi-stage transfer learning | en |
| dc.subject | Text-to-speech | en |
| dc.title | 以生成式人工智慧建立稀少性語言之影片生成系統 | zh_TW |
| dc.title | A GenAI-Based Video Generation System for Low-Resource Languages | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-2 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 林正偉;陳彥廷 | zh_TW |
| dc.contributor.oralexamcommittee | Jeng-Wei Lin;Yen-Ting Chen | en |
| dc.subject.keyword | Text-to-speech,Multi-stage transfer learning,Tacotron 2,Lip synchronization,MuseTalk, | zh_TW |
| dc.subject.keyword | Text-to-speech,Multi-stage transfer learning,Tacotron 2,Lip synchronization,MuseTalk, | en |
| dc.relation.page | 48 | - |
| dc.identifier.doi | 10.6342/NTU202504213 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2025-08-14 | - |
| dc.contributor.author-college | 工學院 | - |
| dc.contributor.author-dept | 工程科學及海洋工程學系 | - |
| dc.date.embargo-lift | 2025-08-20 | - |
| 顯示於系所單位: | 工程科學及海洋工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-2.pdf | 5.5 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
