Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96551
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor李琳山zh_TW
dc.contributor.advisorLin-shan Leeen
dc.contributor.author陳建成zh_TW
dc.contributor.authorChien-cheng Chenen
dc.date.accessioned2025-02-19T16:29:01Z-
dc.date.available2025-02-20-
dc.date.copyright2025-02-19-
dc.date.issued2025-
dc.date.submitted2025-02-03-
dc.identifier.citationD. M. Eberhard, G. F. Simons, and C. D. Fennig, Ethnologue: Languages of the World, 27th ed. Dallas, Texas: SIL International, 2024. [Online]. Available: https://www.ethnologue.com
J. Chorowski, R. J. Weiss, S. Bengio, and A. Van Den Oord, “Unsupervised speech representation learning using wavenet autoencoders,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 12, pp. 2041–2053, 2019, publisher: IEEE.
L.-W. Chen, S. Watanabe, and A. Rudnicky, “A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, pp. 12 644–12 652, Jun. 2023, number: 11. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/ view/26488
X. Zhao, Q. Zhu, J. Zhang, Y. Zhou, and P. Liu, “Speech Enhancement with Multi-granularity Vector Quantization,” in 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Oct. 2023, pp. 1937–1942, iSSN: 2640-0103. [Online]. Available: https: //ieeexplore.ieee.org/abstract/document/10317485
A. Van Den Oord, O. Vinyals et al., “Neural discrete representation learning,” Advances in neural information processing systems, vol. 30, 2017.
K. Lakhotia, E. Kharitonov, W.-N. Hsu, Y. Adi, A. Polyak, B. Bolte, T.-A. Nguyen, J. Copet, A. Baevski, A. Mohamed, and E. Dupoux, “On Generative Spoken Language Modeling from Raw Audio,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1336–1354, 2021, place: Cambridge, MA Publisher: MIT Press. [Online]. Available: https://aclanthology.org/2021.tacl-1.79
A. v. d. Oord, Y. Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding,” Jan. 2019, arXiv:1807.03748 [cs, stat]. [Online]. Available: http://arxiv.org/abs/1807.03748
W.-N. Hsu, Y.-H. H. Tsai, B. Bolte, R. Salakhutdinov, and A. Mohamed, “Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training?” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, ON, Canada: IEEE, Jun. 2021, pp. 6533–6537. [Online]. Available: https://ieeexplore.ieee.org/document/9414460/
A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” in Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 12 449–12 460. [Online]. Available: https://proceedings.neurips.cc/paper/2020/ hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html
P.-J. Chen, K. Tran, Y. Yang, J. Du, J. Kao, Y.-A. Chung, P. Tomasello, P.-A. Duquenne, H. Schwenk, H. Gong, H. Inaguma, S. Popuri, C. Wang, J. Pino, W.-N. Hsu, and A. Lee, “Speech-to-Speech Translation for a Real-world Unwritten Language,” in Findings of the Association for Computational Linguistics: ACL 2023, A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 4969–4983. [Online]. Available: https://aclanthology.org/2023.findings-acl.307
S. Ren, S. Liu, Y. Wu, L. Zhou, and F. Wei, “Speech Pre-training with Acoustic Piece,” in Interspeech 2022. ISCA, Sep. 2022, pp. 2648–2652. [Online]. Available: https://www.isca-archive.org/interspeech_2022/ren22_interspeech.html
D. Wells, H. Tang, and K. Richmond, “Phonetic Analysis of Self-supervised Representations of English Speech,” 2022, pp. 3583–3587. [Online]. Available: https://www.isca-archive.org/interspeech_2022/wells22_interspeech.html
W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021, publisher: IEEE.
F. Wu, K. Kim, S. Watanabe, K. J. Han, R. McDonald, K. Q. Weinberger, and Y. Artzi, “Wav2Seq: Pre-Training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Rhodes Island, Greece: IEEE, Jun. 2023, pp. 1–5. [Online]. Available: https://ieeexplore.ieee.org/document/ 10096988/
X. Chang, B. Yan, K. Choi, J.-W. Jung, Y. Lu, S. Maiti, R. Sharma, J. Shi, J. Tian, S. Watanabe, Y. Fujita, T. Maekaku, P. Guo, Y.-F. Cheng, P. Denisov, K. Saijo, and H.-H. Wang, “Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study,” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2024, pp. 11 481–11 485, iSSN: 2379-190X. [Online]. Available: https://ieeexplore.ieee.org/document/10447929
W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” The bulletin of mathematical biophysics, vol. 5, no. 4, pp. 115–133, Dec. 1943, publisher: Springer. [Online]. Available: https://doi.org/10.1007/BF02478259
F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain.” Psychological Review, vol. 65, no. 6, pp. 386– 408, 1958, publisher: American Psychological Association. [Online]. Available: https://doi.apa.org/doi/10.1037/h0042519
K.-I. Funahashi, “On the approximate realization of continuous mappings by neural networks,” Neural Networks, vol. 2, no. 3, pp. 183–192, Jan. 1989. [Online]. Available: https://www.sciencedirect.com/science/article/pii/0893608089900038
D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, Oct. 1986, publisher: Nature Publishing Group. [Online]. Available: https://www.nature.com/ articles/323533a0
D. E. Rumelhart and J. L. McClelland, “Learning Internal Representations by Error Propagation,” in Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations. MIT Press, 1987, pp. 318–362. [Online]. Available: https://ieeexplore.ieee.org/document/6302929
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Jan. 1998, conference Name: Proceedings of the IEEE. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/726791
D. H. Hubel and T. N. Wiesel, “Receptive fields of single neurones in the cat’s striate cortex,” The Journal of Physiology, vol. 148, no. 3, pp. 574–591, Oct. 1959. [Online]. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1363130/
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder–decoder approaches,” in Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, D. Wu, M. Carpuat, X. Carreras, and E. M. Vecchi, Eds. Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 103–111. [Online]. Available: https://aclanthology.org/W14-4012
I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” 2014.
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” Sep. 2013, arXiv:1301.3781 [cs]. [Online]. Available: http://arxiv.org/abs/1301.3781
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep Contextualized Word Representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), M. Walker, H. Ji, and A. Stent, Eds. New Orleans, Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 2227–2237. [Online]. Available: https://aclanthology.org/N18-1202
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online]. Available: https://aclanthology.org/N19-1423
A. T. Liu, S.-w. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee, “Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders,” Oct. 2019. [Online]. Available: https://arxiv.org/abs/1910.12638v2
L. T, LiShang-Wen, and LeeHung-yi, “TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, publisher: IEEE. [Online]. Available: https://dl.acm.org/doi/10.1109/TASLP.2021.3095662
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019.
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901. [Online]. Available: https://papers. nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
Y.-A. Chung and J. Glass, “Generative Pre-Training for Speech with Autoregressive Predictive Coding,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, pp. 3497–3501, iSSN: 2379-190X.
M. Ravanelli, J. Zhong, S. Pascual, P. Swietojanski, J. Monteiro, J. Trmal, and Y. Bengio, “Multi-Task Self-Supervised Learning for Robust Speech Recognition,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, pp. 6989–6993, iSSN: 2379-190X.
T. Maekaku, X. Chang, Y. Fujita, L.-W. Chen, S. Watanabe, and A. Rudnicky, “Speech representation learning combining conformer cpc with deep cluster for the zerospeech challenge 2021,” 2022.
S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised PreTraining for Speech Recognition,” in Proc. Interspeech 2019, 2019, pp. 3465–3469.
M. Rivière, A. Joulin, P.-E. Mazaré, and E. Dupoux, “Unsupervised pretraining transfers well across languages,” 2020.
A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.
A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” arXiv preprint arXiv:1910.05453, 2019.
“Textless NLP: Generating expressive speech from raw audio,” Sep. 2021. [Online]. Available: https://ai.meta.com/blog/ textless-nlp-generating-expressive-speech-from-raw-audio/
G.-T. Lin, Y.-S. Chuang, H.-L. Chung, S.-w. Yang, H.-J. Chen, S. Dong, S.-W. Li, A. Mohamed, H.-y. Lee, and L.-s. Lee, “Dual: Discrete spoken unit adaptive learning for textless spoken question answering,” arXiv preprint arXiv:2203.04911, 2022.
X. Zhang, D. Zhang, S. Li, Y. Zhou, and X. Qiu, “Speechtokenizer: Unified speech tokenizer for speech large language models,” 2024.
K. Lakhotia, E. Kharitonov, W.-N. Hsu, Y. Adi, A. Polyak, B. Bolte, T.-A. Nguyen, J. Copet, A. Baevski, A. Mohamed, and E. Dupoux, “Generative Spoken Language Modeling from Raw Audio,” Sep. 2021, arXiv:2102.01192 [cs]. [Online]. Available: http://arxiv.org/abs/2102.01192
B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning audio-visual speech representation by masked multimodal cluster prediction,” in International Conference on Learning Representations, 2021.
A. Sicherman and Y. Adi, “Analysing discrete self supervised speech representation for spoken language modeling,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), June 2023, pp. 1–5.
B. M. Abdullah, M. M. Shaik, B. Möbius, and D. Klakow, “An InformationTheoretic Analysis of Self-supervised Discrete Representations of Speech,” in Proc. INTERSPEECH 2023, 2023, pp. 2883–2887.
X. Chang, B. Yan, Y. Fujita, T. Maekaku, and S. Watanabe, “Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning,” in INTERSPEECH 2023. ISCA, Aug. 2023, pp. 1399–1403. [Online]. Available: https://www.isca-archive.org/interspeech_2023/chang23b_interspeech.html
A. H. Liu, H.-J. Chang, M. Auli, W.-N. Hsu, and J. Glass, “Dinosr: Self-distillation and online clustering for self-supervised speech representation learning,” Advances in Neural Information Processing Systems, vol. 36, 2024.
Z. Huang, C. Meng, and T. Ko, “Repcodec: A speech representation codec for speech tokenization,” arXiv preprint arXiv:2309.00169, 2023.
M. de Seyssel, M. Lavechin, Y. Adi, E. Dupoux, and G. Wisniewski, “Probing phoneme, language and speaker information in unsupervised speech representations,” in Proc. Interspeech 2022, 2022, pp. 1402–1406.
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2015, pp. 5206– 5210, iSSN: 2379-190X. [Online]. Available: https://ieeexplore.ieee.org/document/ 7178964
I. P. Association, Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet. Cambridge University Press, 1999.
“The CMU Pronouncing Dictionary.” [Online]. Available: http://www.speech.cs. cmu.edu/cgi-bin/cmudict?stress=-s&in=CITE
A. Klautau, “Arpabet and the timit alphabet,” an archived file. https://web. archive. org/web/20160603180727/http://www. laps. ufpa. br/aldebaro/papers/ak_arpabet01. pdf (Accessed Mar. 12, 2020), 2001.
Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016.
T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, E. Blanco and W. Lu, Eds. Brussels, Belgium: Association for Computational Linguistics, Jan. 2018, pp. 66–71. [Online]. Available: https://aclanthology.org/D18-2012
A. Elkahky, W.-N. Hsu, P. Tomasello, T.-A. Nguyen, R. Algayres, Y. Adi, J. Copet, E. Dupoux, and A. Mohamed, “Do coarser units benefit cluster prediction-based speech pre-training?” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), June 2023, pp. 1–5.
F. Shen, Y. Guo, C. Du, X. Chen, and K. Yu, “Acoustic bpe for speech generation with discrete tokens,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 11 746–11 750.
H.-J. Chang and J. Glass, “R-spin: Efficient speaker and noise-invariant representation learning with acoustic pieces,” arXiv preprint arXiv:2311.09117, 2023.
S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proceedings of Interspeech, 2018, pp. 2207–2211. [Online]. Available: http://dx.doi.org/10.21437/Interspeech. 2018-1456
P. Gage, “A new algorithm for data compression,” C Users J., vol. 12, no. 2, p. 23– 38, feb 1994.
R. Sennrich, B. Haddow, and A. Birch, “Neural Machine Translation of Rare Words with Subword Units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), K. Erk and N. A. Smith, Eds. Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 1715–1725. [Online]. Available: https://aclanthology.org/P16-1162
T. Kudo, “Subword regularization: Improving neural network translation models with multiple subword candidates,” 2018.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96551-
dc.description.abstract  隨著語音科技的進步,強大的語音基石模型已經被廣泛應用於各種語音任務中。基於這些語音模型所得到的語音表徵,透過分群演算法等離散化程序,大量資料與模型已經讓接近於文字的各式離散表徵問世,甚至出現了「不用文字卻可以近似於文字」的「無文字(Textless)自然語言處理(Natural Language Processing,NLP)」架構。
  然而,這些語音的離散表徵與人類對語音或文字的理解究竟有多接近,依然是一個未解之謎。為了解答這個問題,本論文結合語音學的知識,以人類感知得到的、最接近文字且與語音訊號密切相關的「音位(Phoneme)」為基準,分析兩種類型的語音離散表徵 ── 第一種是透過分群演算法得到的「離散單元(Discrete Unit)」,第二種則是將離散單元經過分詞演算法重新組合成的「聲學片段(Acoustic Piece)」。本論文比較了音位與這些離散表徵之間的相關性,探討這些離散表徵是否能夠有效地辨識出與人類認知相近的發音類型(Pattern)。
  通過對離散單元的研究,我們發現 HuBERT 是最適合用於獲取離散表徵的模型,並且增加分群數有助於捕捉更細微的語音特徵。隨後透過對聲學片段的研究發現,聲學片段可以作為分群演算法之外,另一種有效的語音表徵離散化方法。此外,從音位類別的角度分析,我們還觀察到塞音和塞擦音音位較難被語音離散表徵準確歸類,而擦音、雙元音與近音的特徵則相對容易被離散表徵辨識出來。
zh_TW
dc.description.abstractWith recent advancement of speech technology, powerful speech foundation models have been widely applied to various speech tasks. Based on the speech representations obtained from these speech models, through clustering algorithms and other discretization processes, a large amount of data and models have made various discrete representations that are close to text available, and even a framework called “Textless Natural Language Processing” which can approximate texts without using real texts has emerged.
However, how correlated these discrete representations of speech are to human understanding of speech or text remains a mystery. To answer the question, the thesis combines knowledge of phonology and uses the most text-like and closely related to speech signals that humans perceive, “Phoneme,” as a reference. We analyze two types of discrete speech representations ── the first is “Discrete Unit” obtained through clustering algorithms, and the second is “Acoustic Piece” recombined through tokenization algorithms. This thesis compares the correlation between phonemes and these discrete representations, and investigates whether these discrete representations can effectively identify pronunciation patterns that are close to human cognition.
Through the study of discrete units, we found that HuBERT is the most suitable model for obtaining discrete representations, while increasing the number of clusters helps capture more subtle speech features. Subsequently, through the study of acoustic pieces, we found that acoustic pieces can be used as another effective method of discretizing speech representations aside from clustering algorithms. In addition, from the perspective of phoneme types, we also observed that plosives and affricates are difficult to be accurately classified by speech discrete representations, while fricatives, diphthongs, and approximants are relatively easy to be figured out by discrete representations.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-02-19T16:29:01Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-02-19T16:29:01Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents口試委員會審定書 i
誌謝 ii
中文摘要 iv
英文摘要 v
第一章 導論 1
1.1 研究動機 1
1.2 研究方向 3
1.3 主要貢獻 4
1.4 章節安排 5
第二章 背景知識 6
2.1 深層類神經網路 6
2.1.1 簡介 6
2.1.2 卷積式類神經網路 10
2.1.3 遞迴式類神經網路與序列至序列模型 11
2.1.4 專注機制與轉換器類神經網路 12
2.2 表徵與自監督式學習 16
2.2.1 特徵抽取與表徵學習 16
2.2.2 自監督學習 17
2.2.3 向量量化與離散單元 20
2.2.4 無文字(Textless)自然語言處理架構 20
2.3 本章總結 21
第三章 單一語音離散表徵與音位的關係 22
3.1 相關研究 22
3.1.1 無文字自然語言處理與離散語音表徵 22
3.1.2 語音學分析 23
3.2 衡量指標 23
3.2.1 純度 26
3.2.2 熵和相互資訊 27
3.3 語音學的音位分類(Phoneme Type) 28
3.4 實驗集與分析模型 29
3.5 分析方式 33
3.6 分析結果 35
3.6.1 綜觀分析 35
3.6.2 以離散單元角度切入 36
3.6.3 以音位角度切入 46
3.6.4 整體熱圖驗證 53
3.7 本章總結 53
第四章 多個語音離散表徵與音位的關係 55
4.1 動機 55
4.2 相關研究 55
4.3 文字處理中的分詞演算法 56
4.3.1 常見演算法 57
4.3.2 「句片段(SentencePiece)」套件 58
4.4 分析方法 58
4.5 分析結果 59
4.5.1 由聲學片段角度探討 59
4.5.2 由音位角度探討 72
4.5.3 分析結論 76
4.6 本章總結 76
第五章 結論與展望 78
5.1 研究貢獻與討論 78
5.2 未來展望 79
參考文獻 81
-
dc.language.isozh_TW-
dc.subject語音表徵zh_TW
dc.subject相關性zh_TW
dc.subject語音基石模型zh_TW
dc.subject離散單元zh_TW
dc.subject語音學zh_TW
dc.subjectCorrelationen
dc.subjectSpeech Foundation Modelen
dc.subjectDiscrete Uniten
dc.subjectSpeech Representationen
dc.subjectPhonologyen
dc.title語音離散表徵與音位的相關性分析zh_TW
dc.titleCorrelation Analysis Between Discrete Speech Representations and Phonemesen
dc.typeThesis-
dc.date.schoolyear113-1-
dc.description.degree碩士-
dc.contributor.oralexamcommittee曹昱;李宏毅zh_TW
dc.contributor.oralexamcommitteeYu Tsao;Hung-yi Leeen
dc.subject.keyword語音基石模型,離散單元,語音表徵,語音學,相關性,zh_TW
dc.subject.keywordSpeech Foundation Model,Discrete Unit,Speech Representation,Phonology,Correlation,en
dc.relation.page92-
dc.identifier.doi10.6342/NTU202500258-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2025-02-04-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電信工程學研究所-
dc.date.embargo-lift2025-02-20-
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-113-1.pdf9.64 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved