Please use this identifier to cite or link to this item:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97183| Title: | 邁向趨近人類的表徵學習:語音的非監督式句法剖析與音訊影像表徵的泛用性探討 Towards Human-like Representation Learning: Unsupervised Syntax Parsing of Speech and General-Purpose Audio-Visual Representations |
| Authors: | 曾元 Yuan Tseng |
| Advisor: | 李琳山 Lin-shan Lee |
| Keyword: | 自監督式學習,成分句法剖析,音頻-影像學習, self-supervised learning,constituency parsing,audio-visual learning, |
| Publication Year : | 2024 |
| Degree: | 碩士 |
| Abstract: | 「預訓練再微調」(pretrain-then-finetune)這一套訓練方法套用在語音辨識、語者驗證等不同語音處理任務中,都被證實有不錯的效果。在與自監督式學習(self-supervised learning)結合後,這套方法除了顯著地提升效能以外,也為語音科技帶來其他重要的效益,包括減少模型對標註資料的需求,以及簡化不同任務間模型的架構差異。這也顯示不久的未來有機會實現能夠從大量無標註資料與一些標注資料學習,並有能力同時處理多任務、多模態的一個通用模型,讓語音科技向實現人類等級模型的目標更進一步。本論文中探究分析了延伸模型能力之深度與廣度的兩個方向的嘗試:首先,本論文提出一非監督式語音句法剖析任務,以探討在沒有成對資料的情況下,能否直接從語音得到一段語句之句法結構。實驗顯示在缺少成對資料的情況下,從口述語句得到正確的句法剖析樹極為困難。即便如此,模型仍展現出具備初步的判斷訓練資料的語言之分支結構的能力之跡象。其次,本論文在多模態、多任務的更大框架下比較現有的自監督式訓練架構,檢驗現有的訓練架構是否能夠泛用在各種語音及音訊處理任務上。對一影音輸入,一個模型共可以取得音訊、影像、及混合三種內部表徵。接著以單一表徵作為輸入,對每一語音及音訊處理任務去訓練一個小模型,探討模型表徵的泛用性。在評估五個近期提出的模型後,結果顯示並沒有任一單一模型可以適用在所有任務上。透過研究範圍更大、難度更高的任務,本論文希望探索現有自監督式表徵學習的一些可能性與局限性,並希望朝向更像人類能力之通用模型的目標前進。 The pretrain-then-finetune approach has been shown to be an effective direction for speech processing, with successful results in speech recognition, speaker verification, and a wide variety of other speech-related tasks. Combined with self-supervised learning, the paradigm brings major attractive advantages to speech technologies in addition to improved task performance, including reducing the dependency on large quantities of labeled data, and simplifying the task-specific components. This implies we are one step closer to constructing human-like models, able to perform different multi-modal tasks by learning from vast amounts of unlabeled data plus some limited labeled data. This thesis focuses on two different directions towards the above goal: First, the unsupervised spoken constituency parsing task is proposed to examine the possibility of learning high-level linguistic structural information, such as syntax, directly from speech without any paired data. Experiments show that while it is still difficult at this moment for machines to learn to produce correct syntax trees from speech without any supervision, the model does indicate some initial evidence of being able to learn the branching direction of the language used for training. Second, existing self-supervised audio-visual learning frameworks are broadly examined under a wider multi-modal, multi-task framework to determine how capable the existing approaches are on five speech and audio understanding tasks. For each model, three types of internal representations are obtained from auditory, visual, and both inputs, respectively. Next, model performance is measured by finetuning a small prediction head for each task, using each type of representation as input. The results of such an unified evaluation show that no single model can sufficiently generalize to all tasks. By analyzing the applicability of self-supervised learning approaches to more difficult and broader tasks, this thesis aims to demonstrate the potential and shortcomings of existing technologies, in order to facilitate more research towards human-like audio-visual learning. |
| URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97183 |
| DOI: | 10.6342/NTU202500549 |
| Fulltext Rights: | 同意授權(全球公開) |
| metadata.dc.date.embargo-lift: | 2025-02-28 |
| Appears in Collections: | 電信工程學研究所 |
Files in This Item:
| File | Size | Format | |
|---|---|---|---|
| ntu-113-1.pdf | 4.18 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
