語音離散表徵與音位的相關性分析

陳建成; Chien-cheng Chen

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96551

標題:	語音離散表徵與音位的相關性分析 Correlation Analysis Between Discrete Speech Representations and Phonemes
作者:	陳建成 Chien-cheng Chen
指導教授:	李琳山 Lin-shan Lee
關鍵字:	語音基石模型,離散單元,語音表徵,語音學,相關性, Speech Foundation Model,Discrete Unit,Speech Representation,Phonology,Correlation,
出版年 :	2025
學位:	碩士
摘要:	隨著語音科技的進步，強大的語音基石模型已經被廣泛應用於各種語音任務中。基於這些語音模型所得到的語音表徵，透過分群演算法等離散化程序，大量資料與模型已經讓接近於文字的各式離散表徵問世，甚至出現了「不用文字卻可以近似於文字」的「無文字（Textless）自然語言處理（Natural Language Processing，NLP）」架構。　　然而，這些語音的離散表徵與人類對語音或文字的理解究竟有多接近，依然是一個未解之謎。為了解答這個問題，本論文結合語音學的知識，以人類感知得到的、最接近文字且與語音訊號密切相關的「音位（Phoneme）」為基準，分析兩種類型的語音離散表徵 ── 第一種是透過分群演算法得到的「離散單元（Discrete Unit）」，第二種則是將離散單元經過分詞演算法重新組合成的「聲學片段（Acoustic Piece）」。本論文比較了音位與這些離散表徵之間的相關性，探討這些離散表徵是否能夠有效地辨識出與人類認知相近的發音類型（Pattern）。　　通過對離散單元的研究，我們發現 HuBERT 是最適合用於獲取離散表徵的模型，並且增加分群數有助於捕捉更細微的語音特徵。隨後透過對聲學片段的研究發現，聲學片段可以作為分群演算法之外，另一種有效的語音表徵離散化方法。此外，從音位類別的角度分析，我們還觀察到塞音和塞擦音音位較難被語音離散表徵準確歸類，而擦音、雙元音與近音的特徵則相對容易被離散表徵辨識出來。 With recent advancement of speech technology, powerful speech foundation models have been widely applied to various speech tasks. Based on the speech representations obtained from these speech models, through clustering algorithms and other discretization processes, a large amount of data and models have made various discrete representations that are close to text available, and even a framework called “Textless Natural Language Processing” which can approximate texts without using real texts has emerged. However, how correlated these discrete representations of speech are to human understanding of speech or text remains a mystery. To answer the question, the thesis combines knowledge of phonology and uses the most text-like and closely related to speech signals that humans perceive, “Phoneme,” as a reference. We analyze two types of discrete speech representations ── the first is “Discrete Unit” obtained through clustering algorithms, and the second is “Acoustic Piece” recombined through tokenization algorithms. This thesis compares the correlation between phonemes and these discrete representations, and investigates whether these discrete representations can effectively identify pronunciation patterns that are close to human cognition. Through the study of discrete units, we found that HuBERT is the most suitable model for obtaining discrete representations, while increasing the number of clusters helps capture more subtle speech features. Subsequently, through the study of acoustic pieces, we found that acoustic pieces can be used as another effective method of discretizing speech representations aside from clustering algorithms. In addition, from the perspective of phoneme types, we also observed that plosives and affricates are difficult to be accurately classified by speech discrete representations, while fricatives, diphthongs, and approximants are relatively easy to be figured out by discrete representations.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/96551
DOI:	10.6342/NTU202500258
全文授權:	同意授權(全球公開)
電子全文公開日期:	2025-02-20
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-113-1.pdf	9.64 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。