基於語音基石模型之語者自動分段標記系統

李高迪; Ko-Tik Lee

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91760

Title:	基於語音基石模型之語者自動分段標記系統 Improved Speaker Diarization Based on Speech Foundation Models
Authors:	李高迪 Ko-Tik Lee
Advisor:	李宏毅 Hung-Yi Lee
Keyword:	語者自動分段標記,語音基石模型, speaker diarization,speech foundation model,
Publication Year :	2024
Degree:	碩士
Abstract:	當前，語者自動分段標記系統 (speaker diarization) 主要運用三種方法：階段性、端到端和端到端-階段性混合系統。端到端系統在某些資料集上顯著優於其他方法，引起廣泛關注。然而，這種系統可能在實際應用中面臨泛用性限制，而階段性系統的潛力可能被低估。與此同時，近期的語音基石模型 (speech foundation model) 在多項語音任務中表現出色，顯示出其廣泛應用的潛力。然而，在語者自動分段標記方面，對其應用尚未深入探討。因此，本研究旨在將語音基石模型應用於語者自動分段標記相關任務，進行性能比較並進行表現基準化。同時，針對階段性系統存在的問題，提出了改進方法，例如具緩衝區意識的話語開始點偵測和聚類純化，顯著提升了其性能。最後，透過域外評估方法，證實了端到端-階段性混合系統的泛用性問題，並提出了改進方法。本論文改進後的階段性和端到端-階段性混合系統在多個資料集上實現了與最先進技術相當甚至更優越的表現。 Currently, speaker diarization systems primarily employ three methods: incremental, end-to-end, and hybrid incremental end-to-end systems. The end-to-end approach has shown significant superiority over other methods in certain datasets, garnering widespread attention. However, this system might face limitations in real-world applications, potentially underestimating the potential of incremental systems. Simultaneously, recent advancements in speech foundation models have showcased outstanding performance across multiple speech tasks, indicating their broad applicability. Nevertheless, their application specifically in speaker diarization remains insufficiently explored. Therefore, this study aims to apply speech foundation models to tasks related to speaker diarization, conducting performance comparisons and standardization. Additionally, addressing issues present in incremental systems, proposed enhancements such as collar-aware speech onset detection and cluster outlier handling significantly improved their performance. Finally, through out-of-domain evaluations, the limitations of the hybrid systems were confirmed, along with proposed solutions for improvement. The refined incremental and hybrid systems in this paper achieved comparable or even superior performance to state-of-the-art methods across multiple datasets
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91760
DOI:	10.6342/NTU202400179
Fulltext Rights:	同意授權(限校園內公開)
Appears in Collections:	電機工程學系

Files in This Item:

File	Size	Format
ntu-112-1.pdf Access limited in NTU ip range	1.52 MB	Adobe PDF

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets