請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97183完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 李琳山 | zh_TW |
| dc.contributor.advisor | Lin-shan Lee | en |
| dc.contributor.author | 曾元 | zh_TW |
| dc.contributor.author | Yuan Tseng | en |
| dc.date.accessioned | 2025-02-27T16:34:18Z | - |
| dc.date.available | 2025-02-28 | - |
| dc.date.copyright | 2025-02-27 | - |
| dc.date.issued | 2024 | - |
| dc.date.submitted | 2025-02-12 | - |
| dc.identifier.citation | [1] T. Afouras, J. S. Chung, and A. Zisserman. Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496, 2018.
[2] R. Arandjelovic and A. Zisserman. Look, listen and learn. In ICCV, 2017. [3] A. Baevski, M. Auli, and A. Mohamed. Effectiveness of self-supervised pre-training for speech recognition. arXiv preprint arXiv:1911.03912, 2019. [4] A. Baevski, W.-N. Hsu, A. Conneau, and M. Auli. Unsupervised speech recognition. Advances in Neural Information Processing Systems, 2021. [5] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. In NeurIPS, 2020. [6] S. Bhati, J. Villalba, P. Żelasko, L. Moro-Velazquez, and N. Dehak. Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation. In Interspeech, 2021. [7] E. Black, S. Abney, D. Flickenger, C. Gdaniec, R. Grishman, P. Harrison, D. Hindle, R. Ingria, F. Jelinek, J. Klavans, M. Liberman, M. Marcus, S. Roukos, B. Santorini, and T. Strzalkowski. A procedure for quantitatively comparing the syntactic coverage of English grammars. In Speech and Natural Language: Proceedings of a Workshop Held at Pacific Grove, California, February 19-22, 1991, 1991. [8] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008. [9] S. Cao, N. Kitaev, and D. Klein. Unsupervised parsing via constituency tests. In EMNLP, 2020. [10] H. Chen, W. Xie, A. Vedaldi, and A. Zisserman. Vggsound: A large-scale audio-visual dataset. In ICASSP, 2020. [11] K.-Y. Chen, C.-P. Tsai, D.-R. Liu, H.-Y. Lee, and L. shan Lee. Completely Unsupervised Phoneme Recognition by a Generative Adversarial Network Harmonized with Iteratively Refined Hidden Markov Models. In Interspeech, 2019. [12] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 2022. [13] J. S. Chung, A. Nagrani, and A. Zisserman. Voxceleb2: Deep speaker recognition. In Interspeech, 2018. [14] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli. Unsupervised Cross-Lingual Representation Learning for Speech Recognition. In Interspeech, 2021. [15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019. [16] A. Drozdov, S. Rongali, Y.-P. Chen, T. O'Gorman, M. Iyyer, and A. McCallum. Unsupervised latent tree induction with deep inside-outside recursive auto-encoders. In NAACL-HLT, 2019. [17] E. Dunbar, M. Bernard, N. Hamilakis, T. A. Nguyen, M. De Seyssel, P. Rozé, M. Rivière, E. Kharitonov, and E. Dupoux. The zero resource speech challenge 2021: Spoken language modelling. arXiv preprint arXiv:2104.14700, 2021. [18] T. S. Fuchs, Y. Hoshen, and J. Keshet. Unsupervised Word Segmentation using K Nearest Neighbors. In Interspeech, 2022. [19] Z. Galil. Efficient algorithms for finding maximum matching in graphs. ACM Computing Surveys (CSUR), 1986. [20] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, 2017. [21] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In ICML, 2006. [22] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022. [23] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. TASLP, 2021. [24] W.-N. Hsu, D. Harwath, C. Song, and J. Glass. Text-free image-to-speech synthesis using learned segmental units. In ACL-ICJNLP, 2021. [25] P.-Y. Huang, V. Sharma, H. Xu, C. Ryali, h. fan, Y. Li, S.-W. Li, G. Ghosh, J. Malik, and C. Feichtenhofer. Mavil: Masked audio-video learners. In NeurIPS, 2023. [26] P.-Y. Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer. Masked autoencoders that listen. In NeurIPS, 2022. [27] H. Kamper. Word segmentation on discovered phone units with dynamic programming and self-supervised scoring. arXiv preprint arXiv:2202.11929, 2022. [28] T. Kasami. An efficient recognition and syntax-analysis algorithm for context-free languages. Coordinated Science Laboratory Report, 1966. [29] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017. [30] Y. Kim, C. Dyer, and A. M. Rush. Compound probabilistic context-free grammars for grammar induction. In ACL, 2019. [31] Y. Kim, A. M. Rush, L. Yu, A. Kuncoro, C. Dyer, and G. Melis. Unsupervised recurrent neural network grammars. In NAACL-HLT, 2019. [32] N. Kitaev and D. Klein. Constituency parsing with a self-attentive encoder. In ACL, 2018. [33] S. Lee, Y. Yu, G. Kim, T. Breuel, J. Kautz, and Y. Song. Parameter efficient multi-modal transformers for video representation learning. In ICLR, 2021. [34] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. [35] D.-R. Liu, K.-Y. Chen, H.-Y. Lee, and L.-S. Lee. Completely Unsupervised Phoneme Recognition by Adversarially Learning Mapping Relationships from Audio Embeddings. In Interspeech, 2018. [36] P. Ma, R. Mira, S. Petridis, B. W. Schuller, and M. Pantic. LiRA: Learning Visual Speech Representations from Audio Through Self-Supervision. In Interspeech, 2021. [37] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi. In Interspeech, 2017. [38] H. McGurk and J. MacDonald. Hearing lips and seeing voices. Nature, 1976. [39] H. Mittal, P. Morgado, U. Jain, and A. Gupta. Learning state-aware visual representations from audible interactions. In NeurIPS, 2022. [40] T. A. Nguyen, M. de Seyssel, P. Rozé, M. Rivière, E. Kharitonov, A. Baevski, E. Dunbar, and E. Dupoux. The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. In NeurIPS SAS Workshop, 2020. [41] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata. Audio-visual speech recognition using deep learning. Applied Intelligence, 2015. [42] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: An asr corpus based on public domain audio books. In ICASSP, 2015. [43] A. Pasad, J.-C. Chou, and K. Livescu. Layer-wise analysis of a self-supervised speech representation model. In ASRU, 2021. [44] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextualized word representations. In M. Walker, H. Ji, and A. Stent, editors, NAACL, 2018. [45] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior. Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE, 2003. [46] G. Shen, A. Alishahi, A. Bisazza, and G. Chrupała. Wave to Syntax: Probing spoken language models for syntax. In Interspeech, 2023. [47] Y. Shen, S. Tan, A. Sordoni, and A. Courville. Ordered neurons: Integrating tree structures into recurrent neural networks. In ICLR, 2019. [48] B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed. Learning audio-visual speech representation by masked multimodal cluster prediction. In ICLR, 2022. [49] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. [50] B. Wan, W. Han, Z. Zheng, and T. Tuytelaars. Unsupervised vision-language grammar induction with shared structure modeling. In ICLR, 2021. [51] F. Wang, J. Cheng, W. Liu, and H. Liu. Additive margin softmax for face verification. IEEE Signal Processing Letters, 2018. [52] Y.-S. Wang, H.-Y. Lee, and Y.-N. Chen. Tree transformer: Integrating tree structures into self-attention. In EMNLP-IJCNLP, 2019. [53] S. wen Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H. yi Lee. SUPERB: Speech Processing Universal PERformance Benchmark. In Interspeech, 2021. [54] S. Yang, Y. Zhao, and K. Tu. Neural bi-lexicalized PCFG induction. In ACL-IJCNLP, 2021. [55] D. H. Younger. Recognition and parsing of context-free languages in time n3. Information and Control, 1967. [56] Y. Zhao and I. Titov. Visually grounded compound pcfgs. In EMNLP, 2020. [57] H. Zhu, Y. Bisk, and G. Neubig. The return of lexical dependencies: Neural lexicalized PCFGs. TACL, 2020. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97183 | - |
| dc.description.abstract | 「預訓練再微調」(pretrain-then-finetune)這一套訓練方法套用在語音辨識、語者驗證等不同語音處理任務中,都被證實有不錯的效果。在與自監督式學習(self-supervised learning)結合後,這套方法除了顯著地提升效能以外,也為語音科技帶來其他重要的效益,包括減少模型對標註資料的需求,以及簡化不同任務間模型的架構差異。這也顯示不久的未來有機會實現能夠從大量無標註資料與一些標注資料學習,並有能力同時處理多任務、多模態的一個通用模型,讓語音科技向實現人類等級模型的目標更進一步。本論文中探究分析了延伸模型能力之深度與廣度的兩個方向的嘗試:首先,本論文提出一非監督式語音句法剖析任務,以探討在沒有成對資料的情況下,能否直接從語音得到一段語句之句法結構。實驗顯示在缺少成對資料的情況下,從口述語句得到正確的句法剖析樹極為困難。即便如此,模型仍展現出具備初步的判斷訓練資料的語言之分支結構的能力之跡象。其次,本論文在多模態、多任務的更大框架下比較現有的自監督式訓練架構,檢驗現有的訓練架構是否能夠泛用在各種語音及音訊處理任務上。對一影音輸入,一個模型共可以取得音訊、影像、及混合三種內部表徵。接著以單一表徵作為輸入,對每一語音及音訊處理任務去訓練一個小模型,探討模型表徵的泛用性。在評估五個近期提出的模型後,結果顯示並沒有任一單一模型可以適用在所有任務上。透過研究範圍更大、難度更高的任務,本論文希望探索現有自監督式表徵學習的一些可能性與局限性,並希望朝向更像人類能力之通用模型的目標前進。 | zh_TW |
| dc.description.abstract | The pretrain-then-finetune approach has been shown to be an effective direction for speech processing, with successful results in speech recognition, speaker verification, and a wide variety of other speech-related tasks. Combined with self-supervised learning, the paradigm brings major attractive advantages to speech technologies in addition to improved task performance, including reducing the dependency on large quantities of labeled data, and simplifying the task-specific components. This implies we are one step closer to constructing human-like models, able to perform different multi-modal tasks by learning from vast amounts of unlabeled data plus some limited labeled data. This thesis focuses on two different directions towards the above goal: First, the unsupervised spoken constituency parsing task is proposed to examine the possibility of learning high-level linguistic structural information, such as syntax, directly from speech without any paired data. Experiments show that while it is still difficult at this moment for machines to learn to produce correct syntax trees from speech without any supervision, the model does indicate some initial evidence of being able to learn the branching direction of the language used for training. Second, existing self-supervised audio-visual learning frameworks are broadly examined under a wider multi-modal, multi-task framework to determine how capable the existing approaches are on five speech and audio understanding tasks. For each model, three types of internal representations are obtained from auditory, visual, and both inputs, respectively. Next, model performance is measured by finetuning a small prediction head for each task, using each type of representation as input. The results of such an unified evaluation show that no single model can sufficiently generalize to all tasks. By analyzing the applicability of self-supervised learning approaches to more difficult and broader tasks, this thesis aims to demonstrate the potential and shortcomings of existing technologies, in order to facilitate more research towards human-like audio-visual learning. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-02-27T16:34:18Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-02-27T16:34:18Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 口試委員審定書 i
摘要 iii Abstract v 目次 vii 圖次 xi 表次 xiii 第一章 導論 1 1.1 研究動機 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 研究方向 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 研究貢獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 章節安排 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 第二章 背景知識 5 2.1 監督式學習 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 深層類神經網路(Deep Neural Networks) . . . . . . . . . . . . . . 7 2.2.1 簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2 全連接類層 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.3 卷積層 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.4 遞迴層 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.5 轉換器層 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 自監督式學習(Self-Supervised Learning, SSL) . . . . . . . . . . . 11 2.3.1 自監督式語音模型 . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 音訊-影像自監督式學習 . . . . . . . . . . . . . . . . . . . . . . 14 2.4 成分句法剖析 (Constituency Parsing) . . . . . . . . . . . . . . . . . . 15 2.4.1 問題簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.2 非監督式成分句法剖析(Unsupervised Constituency Parsing) . 15 第三章 探討自監督式模型能力之深度——語音非監督式成分句法剖析(Unsupervised Spoken Constituency Parsing) 19 3.1 實驗動機 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 問題定義與正確性衡量 . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 本章採用的剖析器架構 . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 實驗方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4.1 串接式系統:以語音辨識轉寫結果作為句法剖析器輸入 . . . . 24 3.4.2 直接式系統:以語音表徵作為成分句法剖析器輸入 . . . . . . . 25 3.5 實驗設定 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.6 實驗結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.6.1 串接式系統結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.6.2 直接式系統結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.6.3 直接式系統之分支方向 . . . . . . . . . . . . . . . . . . . . . . . 29 3.7 本章結論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 第四章 探討自監督式模型能力之廣度——模型於音訊-影像任務之效用評比 31 4.1 實驗動機 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 實驗設定 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2.1 所評量之自監督式表徵模型 . . . . . . . . . . . . . . . . . . . . 33 4.2.2 任務及資料集簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 實驗結果與討論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3.1 五表徵模型之評量結果 . . . . . . . . . . . . . . . . . . . . . . . 36 4.3.2 逐層貢獻度分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.3 監督式訓練對表徵泛用性之影響 . . . . . . . . . . . . . . . . . . 41 4.3.4 本章結論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 第五章 結論與展望 43 5.1 研究總結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 未來展望 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 參考文獻 45 | - |
| dc.language.iso | zh_TW | - |
| dc.subject | 音頻-影像學習 | zh_TW |
| dc.subject | 成分句法剖析 | zh_TW |
| dc.subject | 自監督式學習 | zh_TW |
| dc.subject | audio-visual learning | en |
| dc.subject | self-supervised learning | en |
| dc.subject | constituency parsing | en |
| dc.title | 邁向趨近人類的表徵學習:語音的非監督式句法剖析與音訊影像表徵的泛用性探討 | zh_TW |
| dc.title | Towards Human-like Representation Learning: Unsupervised Syntax Parsing of Speech and General-Purpose Audio-Visual Representations | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-1 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 曹昱;陳尚澤;賴穎暉;王新民;李宏毅 | zh_TW |
| dc.contributor.oralexamcommittee | Yu Tsao;Shang-Tse Chen;Ying-Hui Lai;Hsin-Min Wang;Hung-yi Lee | en |
| dc.subject.keyword | 自監督式學習,成分句法剖析,音頻-影像學習, | zh_TW |
| dc.subject.keyword | self-supervised learning,constituency parsing,audio-visual learning, | en |
| dc.relation.page | 51 | - |
| dc.identifier.doi | 10.6342/NTU202500549 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2025-02-12 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 電信工程學研究所 | - |
| dc.date.embargo-lift | 2025-02-28 | - |
| 顯示於系所單位: | 電信工程學研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-1.pdf | 4.18 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
