Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電機工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91760
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor李宏毅zh_TW
dc.contributor.advisorHung-Yi Leeen
dc.contributor.author李高迪zh_TW
dc.contributor.authorKo-Tik Leeen
dc.date.accessioned2024-02-22T16:35:59Z-
dc.date.available2024-02-23-
dc.date.copyright2024-02-22-
dc.date.issued2024-
dc.date.submitted2024-01-31-
dc.identifier.citation[1] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli. Xls-r: Selfsupervised cross-lingual speech representation learning at scale, 2021.
[2] A. Baevski, H. Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations, 2020.
[3] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. FeiFei, C. Finn, T. Gale, L. Gillespie, K. Goel, N. Goodman, S. Grossman, N. Guha, T. Hashimoto, P. Henderson, J. Hewitt, D. E. Ho, J. Hong, K. Hsu, J. Huang, T. Icard, S. Jain, D. Jurafsky, P. Kalluri, S. Karamcheti, G. Keeling, F. Khani, O. Khattab, P. W. Koh, M. Krass, R. Krishna, R. Kuditipudi, A. Kumar, F. Ladhak, M. Lee, T. Lee, J. Leskovec, I. Levent, X. L. Li, X. Li, T. Ma, A. Malik, C. D. Manning, S. Mirchandani, E. Mitchell, Z. Munyikwa, S. Nair, A. Narayan, D. Narayanan, B. Newman, A. Nie, J. C. Niebles, H. Nilforoshan, J. Nyarko, G. Ogut, L. Orr, I. Papadimitriou, J. S. Park, C. Piech, E. Portelance, C. Potts, A. Raghunathan, R. Reich, H. Ren, F. Rong, Y. Roohani, C. Ruiz, J. Ryan, C. Ré, D. Sadigh, S. Sagawa, K. Santhanam, A. Shih, K. Srinivasan, A. Tamkin, R. Taori, A. W. Thomas, F. Tramèr, R. E. Wang, W. Wang, B. Wu, J. Wu, Y. Wu, S. M. Xie, M. Yasunaga, J. You, M. Zaharia, M. Zhang, T. Zhang, X. Zhang, Y. Zhang, L. Zheng, K. Zhou, and P. Liang. On the opportunities and risks of foundation models, 2022.
[4] H. Bredin. pyannote.metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 2017.
[5] H. Bredin and A. Laurent. End-to-end speaker segmentation for overlap-aware resegmentation, 2021.
[6] H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill. pyannote.audio: neural building blocks for speaker diarization, 2019.
[7] S. J. Broughton and L. Samarakoon. Improving end-to-end neural diarization using conversational summary representations, 2023.
[8] L. Bullock, H. Bredin, and L. P. Garcia-Perera. Overlap-aware diarization: resegmentation using neural end-to-end overlapped speech detection, 2019.
[9] J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, I. McCowan, W. Post, D. Reidsma, and P. Wellner. The ami meeting corpus: A pre-announcement. In S. Renals and S. Bengio, editors, Machine Learning for Multimodal Interaction, pages 28–39, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.
[10] H.-J. Chang, S. wen Yang, and H. yi Lee. Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit bert, 2022.
[11] X. Chang, T. Maekaku, P. Guo, J. Shi, Y.-J. Lu, A. S. Subramanian, T. Wang, S. wen Yang, Y. Tsao, H. yi Lee, and S. Watanabe. An exploration of self-supervised pretrained representations for end-to-end speech recognition, 2021.
[12] V. Chemudupati, M. Tahaei, H. Guimaraes, A. Pimentel, A. Avila, M. Rezagholizadeh, B. Chen, and T. Falk. On the transferability of whisper-based representations for ”in-the-wild” cross-task downstream speech applications, 2023.
[13] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, Oct. 2022.
[14] W. Chen, J. Huang, and T. Bocklet. Length- and Noise-Aware Training Techniques for Short-Utterance Speaker Recognition. In Proc. Interspeech 2020, pages 3835– 3839, 2020.
[15] Z. Chen, B. Han, S. Wang, and Y. Qian. Attention-based encoder-decoder end-toend neural diarization with embedding enhancer, 2023.
[16] G. Cheng, Y. Chen, R. Yang, Q. Li, Z. Yang, L. Ye, P. Zhang, Q. Zhang, L. Xie, Y. Qian, K. A. Lee, and Y. Yan. The conversational short-phrase speaker diarization (cssd) task: Dataset, evaluation metric and baselines, 2022.
[17] J. S. Chung, J. Huh, A. Nagrani, T. Afouras, and A. Zisserman. Spot the conversation: speaker diarisation in the wild. In Interspeech, 2020.
[18] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli. Unsupervised cross-lingual representation learning for speech recognition, 2020.
[19] T. Cord-Landwehr, C. Boeddeker, C. Zorilă, R. Doddipatla, and R. Haeb-Umbach. Frame-wise and overlap-robust speaker embeddings for meeting diarization, 2023.
[20] J. M. Coria, H. Bredin, S. Ghannay, and S. Rosset. Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation, 2021.
[21] B. Desplanques, J. Thienpondt, and K. Demuynck. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. In Interspeech 2020, interspeech 2020. ISCA, Oct. 2020.
[22] J. Diliberto, C. Pereira, A. Nikiforovskaja, and M. Sahidullah. Speaker diarization with overlapped speech, 2021.
[23] Z.-C. Fan, Z. Bai, X.-L. Zhang, S. Rahardja, and J. Chen. Auc optimization for deep learning based voice activity detection. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6760– 6764, 2019.
[24] Y. Fu, L. Cheng, S. Lv, Y. Jv, Y. Kong, Z. Chen, Y. Hu, L. Xie, J. Wu, H. Bu, X. Xu, J. Du, and J. Chen. Aishell-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario, 2021.
[25] Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watanabe. End-to-end neural speaker diarization with permutation-free objectives, 2019.
[26] Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, and S. Watanabe. Endto-end neural speaker diarization with self-attention, 2019.
[27] Y. Fujita, T. Komatsu, R. Scheibler, Y. Kida, and T. Ogawa. Neural diarization with non-autoregressive intermediate attractors, 2023.
[28] I. Fung, L. Samarakoon, and S. J. Broughton. Robust end-to-end diarization with domain adaptive training and multi-task learning, 2023.
[29] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree. Speaker diarization using deep neural network embeddings. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4930–4934, 2017.
[30] Y. Gong, S. Khurana, L. Karlinsky, and J. Glass. Whisper-at: Noise-robust automatic speech recognizers are also strong audio event taggers. In Proc. Interspeech 2023, 2023.
[31] J. Gui, T. Chen, J. Zhang, Q. Cao, Z. Sun, H. Luo, and D. Tao. A survey on selfsupervised learning: Algorithms, applications, and future trends, 2023.
[32] E. Han, C. Lee, and A. Stolcke. Bw-eda-eend: streaming end-to-end neural speaker diarization for a variable number of speakers. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, June 2021.
[33] M.-K. He, J. Du, Q.-F. Liu, and C.-H. Lee. Ansd-ma-mse: Adaptive neural speaker diarization using memory-aware multi-speaker embedding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1561–1573, 2023.
[34] Y. He, Z. Kang, J. Wang, J. Peng, and J. Xiao. Voiceextender: Short-utterance text-independent speaker verification with guided diffusion model, 2023.
[35] S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and P. Garcia. Encoder-decoder based attractors for end-to-end neural diarization. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:1493–1507, 2022.
[36] S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and K. Nagamatsu. End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors, 2020.
[37] S. Horiguchi, S. Watanabe, P. Garcia, Y. Takashima, and Y. Kawaguchi. Online neural diarization of unlimited numbers of speakers using global and local attractors. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:706–720, 2023.
[38] S. Horiguchi, S. Watanabe, P. Garcia, Y. Xue, Y. Takashima, and Y. Kawaguchi. Towards neural diarization for unlimited numbers of speakers using global and local attractors, 2021.
[39] M. Hruz and M. Hlaváč. Lstm neural network for speaker change detection in telephone conversations, 09 2019.
[40] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units, 2021.
[41] J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673, 2020. https://github.com/facebookresearch/libri-light.
[42] J. Kalda and T. Alumäe. Collar-aware training for streaming speaker change detection in broadcast speech, 2022.
[43] K. Kinoshita, M. Delcroix, and N. Tawara. Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech, 2021.
[44] K. Kinoshita, M. Delcroix, and N. Tawara. Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds, 2021.
[45] K. Kinoshita, T. von Neumann, M. Delcroix, C. Boeddeker, and R. Haeb-Umbach. Utterance-by-utterance overlap-aware neural diarization with graph-pit, 2022.
[46] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur. A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5220–5224, 2017.
[47] M. Kunešová, M. Hrúz, Z. Zajíc, and V. Radová. Detection of overlapping speech for the purposes of speaker diarization. In Speech and Computer: 21st International Conference, SPECOM 2019, Istanbul, Turkey, August 20–25, 2019, Proceedings 21, pages 247–257. Springer, 2019.
[48] M. Kunešová and Z. Zajíc. Multitask detection of speaker changes, overlapping speech and voice activity using wav2vec 2.0. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, June 2023.
[49] N. Kuzmin, I. Fedorov, and A. Sholokhov. Magnitude-aware probabilistic speaker embeddings. In The Speaker and Language Recognition Workshop (Odyssey 2022), odyssey 2022. ISCA, June 2022.
[50] Y. Kwon, H. S. Heo, J. Huh, B.-J. Lee, and J. S. Chung. Look who’s not talking, 2020.
[51] Y. Kwon, J. weon Jung, H.-S. Heo, Y. J. Kim, B.-J. Lee, and J. S. Chung. Adapting speaker embeddings for speaker diarisation, 2021.
[52] F. Landini, M. Diez, T. Stafylakis, and L. Burget. Diaper: End-to-end neural diarization with perceiver-based attractors, 2023.
[53] F. Landini, J. Profant, M. Diez, and L. Burget. Bayesian hmm clustering of xvector sequences (vbx) in speaker diarization: theory, implementation and analysis on standard tasks, 2020.
[54] M. Lebourdais, T. Mariotte, M. Tahon, A. Larcher, A. Laurent, S. Montresor, S. Meignier, and J.-H. Thomas. Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains, 2023.
[55] J.-H. Lee, D. Yoon, B. Ji, K. Kim, and S. Hwang. Rethinking evaluation protocols of visual representations learned via self-supervised learning, 2023.
[56] L. Li, R. Liu, J. Kang, Y. Fan, H. Cui, Y. Cai, R. Vipperla, T. F. Zheng, and D. Wang. Cn-celeb: multi-genre speaker recognition, 2021.
[57] Y. Li, Z. Zhao, O. Klejch, P. Bell, and C. Lai. Asr and emotional speech: A wordlevel investigation of the mutual impact of speech and emotion recognition, 2023.
[58] H. Liu, J. Li, Y. Wu, and Y. Fu. Clustering with outlier removal, 2019.
[59] T. Liu, S. Fan, X. Xiang, H. Song, S. Lin, J. Sun, T. Han, S. Chen, B. Yao, S. Liu, Y. Wu, Y. Qian, and K. Yu. MSDWild: Multi-modal Speaker Diarization Dataset in the Wild. In Proc. Interspeech 2022, pages 1476–1480, 2022.
[60] T. Liu and K. Yu. Ber: Balanced error rate for speaker diarization, 2022.
[61] X. Liu, H. Peng, N. Zheng, Y. Yang, H. Hu, and Y. Yuan. Efficientvit: Memory efficient vision transformer with cascaded group attention, 2023.
[62] Y. C. Liu, E. Han, C. Lee, and A. Stolcke. End-to-end neural diarization: From transformer to conformer. In Interspeech 2021, interspeech 2021. ISCA, Aug. 2021.
[63] D. Marutho, S. Hendra Handaka, E. Wijaya, and Muljono. The determination of cluster number at k-mean using elbow method and purity evaluation on headline news. In 2018 International Seminar on Application for Technology of Information and Communication, pages 533–538, 2018.
[64] S. Mihalache, I.-A. Ivanov, and D. Burileanu. Deep neural networks for voice activity detection. In 2021 44th International Conference on Telecommunications and Signal Processing (TSP), pages 191–194, 2021.
[65] A. Mohamed, H.-y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff, S.-W. Li, K. Livescu, L. Maaloe, T. N. Sainath, and S. Watanabe. Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing, 16(6):1179–1210, Oct. 2022.
[66] A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman. Voxceleb: Large-scale speaker verification in the wild. Computer Science and Language, 2019.
[67] S. Otterson and M. Ostendorf. Efficient use of overlap information in speaker diarization. In 2007 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 683–686, 2007.
[68] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015.
[69] T. J. Park, N. Kanda, D. Dimitriadis, K. J. Han, S. Watanabe, and S. Narayanan. A review of speaker diarization: Recent advances with deep learning, 2021.
[70] J. Patino, R. Yin, H. Delgado, H. Bredin, A. Komaty, G. Wisniewski, C. Barras, N. Evans, and S. Marcel. Low-Latency Speaker Spotting with Online Diarization and Detection. In Odyssey 2018, The Speaker and Language Recognition Workshop, pages 140– 146, Les Sables d’Olonnes, France, June 2018.
[71] A. Plaquet and H. Bredin. Powerset multi-class cross entropy loss for neural speaker diarization. In INTERSPEECH 2023. ISCA, aug 2023.
[72] A. Poddar, M. Sahidullah, and G. Saha. Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biometrics, 7(2):91–101, 2018.
[73] M. Przybocki and A. Martin. Nist speaker recognition evaluation (ldc2001s97). In New Jersey: Linguistic Data Consortium, 2001, 2000.
[74] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervision, 2022.
[75] D. Raj, D. Povey, and S. Khudanpur. Gpu-accelerated guided source separation for meeting transcription, 2023.
[76] M. Ravanelli and Y. Bengio. Speaker recognition from raw waveform with sincnet, 2019.
[77] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, and M. Liberman. The second dihard diarization challenge: Dataset, task, and baselines, 2019.
[78] L. Samarakoon, S. J. Broughton, M. Härkönen, and I. Fung. Transformer attractors for robust and efficient end-to-end neural diarization, 2023.
[79] E. A. Schegloff. Overlapping talk and the organization of turn-taking for conversation. Language in Society, 29(1):1–63, 2000.
[80] E. Schubert. Stop using the elbow criterion for k-means and how to choose the number of clusters instead. ACM SIGKDD Explorations Newsletter, 25(1):36–42, June 2023.
[81] L. Serafini, S. Cornell, G. Morrone, E. Zovato, A. Brutti, and S. Squartini. An experimental review of speaker diarization methods with application to two-speaker conversational telephone speech recordings, 2023.
[82] K. R. Shahapure and C. Nicholas. Cluster quality analysis using silhouette score. In 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), pages 747–748, 2020.
[83] D. Snyder, G. Chen, and D. Povey. Musan: A music, speech, and noise corpus, 2015.
[84] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5329– 5333, 2018.
[85] H. Tachibana. Towards listening to 10 people simultaneously: An efficient permutation invariant training of audio source separation using sinkhorn's algorithm. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, June 2021.
[86] S. Team. Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier. https://github.com/snakers4/silero-vad, 2021.
[87] TencentGameMate. Chinese Speech Pretraining GitHub Repository. https:// github.com/TencentGameMate/chinese_speech_pretrain, 2022. Accessed: October 2, 2023.
[88] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need, 2023.
[89] U. von Luxburg. A tutorial on spectral clustering, 2007.
[90] T. von Neumann, K. Kinoshita, C. Boeddeker, M. Delcroix, and R. Haeb-Umbach. Graph-PIT: Generalized permutation invariant training for continuous separation of arbitrary numbers of speakers. In Interspeech 2021. ISCA, aug 2021.
[91] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl. Constrained k-means clustering with background knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, page 577–584, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
[92] H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y. Deng, and Y. Qian. Wespeaker: A research and production oriented speaker embedding learning toolkit. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
[93] H. Wang, S. Zheng, Y. Chen, L. Cheng, and Q. Chen. Cam++: A fast and efficient network for speaker verification using context-aware masking, 2023.
[94] Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno. Speaker diarization with lstm, 2022.
[95] Q. Wang, Y. Huang, H. Lu, G. Zhao, and I. L. Moreno. Highly efficient real-time streaming and fully on-device speaker diarization with multi-stage clustering, 2023.
[96] Z. Wang, Y. Luo, R. Qiu, Z. Huang, and M. Baktashmotlagh. Learning to diversify for single domain generalization, 2023.
[97] S. wen Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H. yi Lee. Superb: Speech processing universal performance benchmark, 2021.
[98] J. weon Jung, H.-S. Heo, Y. Kwon, J. S. Chung, and B.-J. Lee. Three-class overlapped speech detection using a convolutional recurrent neural network, 2021.
[99] J. weon Jung, S. Seo, H.-S. Heo, G. Kim, Y. J. Kim, Y. ki Kwon, M. Lee, and B.-J. Lee. Encoder-decoder multimodal speaker change detection, 2023.
[100] J. Wu, Z. Chen, M. Hu, X. Xiao, and J. Li. Speaker change detection for transformer transducer asr, 2023.
[101] Y. Xue, S. Horiguchi, Y. Fujita, S. Watanabe, and K. Nagamatsu. Online end-to-end neural diarization with speaker-tracing buffer, 2021.
[102] H. Yang, J. Zhao, G. Haffari, and E. Shareghi. Investigating pre-trained audio encoders in the low-resource condition, 2023.
[103] Z. Yang, Y. Chen, L. Luo, R. Yang, L. Ye, G. Cheng, J. Xu, Y. Jin, Q. Zhang, P. Zhang, L. Xie, and Y. Yan. Open source magicdata-ramc: A rich annotated mandarin conversational(ramc) speech dataset, 2022.
[104] R. Yin, H. Bredin, and C. Barras. Speaker Change Detection in Broadcast TV Using Bidirectional Long Short-Term Memory Networks. In Interspeech 2017, Proceedings of the 18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017)., Stockholm, Sweden, Aug. 2017. ISCA.
[105] R. Yin, H. Bredin, and C. Barras. Neural speech turn segmentation and affinity propagation for speaker diarization. In Annual Conference of the International Speech Communication Association, Hyderabad, India, Sept. 2018.
[106] Z. Yin, J. Tian, X. Hu, X. Xu, and Y. Xiang. Large-scale learning on overlapped speech detection: New benchmark and new general system, 2023.
[107] D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen. Permutation invariant training of deep models for speaker-independent multi-talker speech separation, 2017.
[108] F. Yu, S. Zhang, Y. Fu, L. Xie, S. Zheng, Z. Du, W. Huang, P. Guo, Z. Yan, B. Ma, X. Xu, and H. Bu. M2met: The icassp 2022 multi-channel multi-party meeting transcription challenge, 2022.
[109] N. Zeghidour and D. Grangier. Wavesplit: End-to-end speech separation by speaker clustering, 2020.
[110] A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang. Fully supervised speaker diarization, 2019.
[111] B. Zhang, H. Lv, P. Guo, Q. Shao, C. Yang, L. Xie, X. Xu, H. Bu, X. Chen, C. Zeng, D. Wu, and Z. Peng. Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition, 2022.
[112] Y. Zhang, W. Han, J. Qin, Y. Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V. Axelrod, G. Wang, Z. Meng, K. Hu, A. Rosenberg, R. Prabhavalkar, D. S. Park, P. Haghani, J. Riesa, G. Perng, H. Soltau, T. Strohman, B. Ramabhadran, T. Sainath, P. Moreno, C.-C. Chiu, J. Schalkwyk, F. Beaufays, and Y. Wu. Google usm: Scaling automatic speech recognition beyond 100 languages, 2023.
[113] Özgür Çetin and E. Shriberg. Analysis of overlaps in meetings by dialog factors, hot spots, speakers, and collection site: insights for automatic speech recognition. In Proc. Interspeech 2006, pages paper 1915–Mon2A2O.6, 2006.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91760-
dc.description.abstract當前,語者自動分段標記系統 (speaker diarization) 主要運用三種方法:階段性、端到端和端到端-階段性混合系統。端到端系統在某些資料集上顯著優於其他 方法,引起廣泛關注。然而,這種系統可能在實際應用中面臨泛用性限制,而階段性系統的潛力可能被低估。與此同時,近期的語音基石模型 (speech foundation model) 在多項語音任務中表現出色,顯示出其廣泛應用的潛力。然而,在語者自動分段標記方面,對其應用尚未深入探討。 因此,本研究旨在將語音基石模型應用於語者自動分段標記相關任務,進行性能比較並進行表現基準化。同時,針對階段性系統存在的問題,提出了改進方法,例如具緩衝區意識的話語開始點偵測和聚類純化,顯著提升了其性能。最後,透過域外評估方法,證實了端到端-階段性混合系統的泛用性問題,並提出了改進方法。本論文改進後的階段性和端到端-階段性混合系統在多個資料集上實現了與最先進技術相當甚至更優越的表現。zh_TW
dc.description.abstractCurrently, speaker diarization systems primarily employ three methods: incremental, end-to-end, and hybrid incremental end-to-end systems. The end-to-end approach has shown significant superiority over other methods in certain datasets, garnering widespread attention. However, this system might face limitations in real-world applications, potentially underestimating the potential of incremental systems. Simultaneously, recent advancements in speech foundation models have showcased outstanding performance across multiple speech tasks, indicating their broad applicability. Nevertheless, their application specifically in speaker diarization remains insufficiently explored. Therefore, this study aims to apply speech foundation models to tasks related to speaker diarization, conducting performance comparisons and standardization. Additionally, addressing issues present in incremental systems, proposed enhancements such as collar-aware speech onset detection and cluster outlier handling significantly improved their performance. Finally, through out-of-domain evaluations, the limitations of the hybrid systems were confirmed, along with proposed solutions for improvement. The refined incremental and hybrid systems in this paper achieved comparable or even superior performance to state-of-the-art methods across multiple datasetsen
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-02-22T16:35:59Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2024-02-22T16:35:59Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents摘要 i
Abstract iii
目次 v
圖次 ix
表次 xi
第一章 導論 1
1.1 研究背景與動機 1
1.2 研究方法 3
1.3 主要貢獻 5
1.4 章節安排 6
第二章 背景知識 7
2.1 語音基石模型 7
2.1.1 簡介 7
2.1.2 自監督式語音模型 7
2.1.3 自動語音辨識系統Whisper 9
2.1.4 SUPERB 基準 10
2.2 階段性系統 12
2.2.1 簡介 12
2.2.2 語音活性偵測(VAD) 13
2.2.3 重疊語音偵測(OSD) 13
2.2.4 語者切換點偵測(SCD) 14
2.2.5 語者特徵向量提取 15
2.2.6 語者聚類 16
2.3 端到端系統 17
2.3.1 EEND 17
2.3.2 EEND 變形 19
2.4 端到端-階段性混合系統 21
2.4.1 EEND-VC 21
2.4.2 Graph-PIT-EEND-VC 24
2.5 評估指標 26
2.6 資料集 27
第三章 階段性系統 29
3.1 語音基石模型用於SUPERB 基準 29
3.1.1 簡介 29
3.1.2 實驗方法 29
3.1.3 實驗設定 30
3.1.4 實驗結果分析 31
3.2 語音基石模型用於語音活性偵測及重疊語音偵測 33
3.2.1 簡介 33
3.2.2 實驗方法 33
3.2.3 實驗設定 34
3.2.4 實驗結果分析 35
3.3 語音基石模型用於語者切換點偵測 37
3.3.1 簡介 37
3.3.2 具緩衝區意識之語者切換點偵測 38
3.3.3 實驗方法 40
3.3.4 實驗結果分析 41
3.4 語者特徵向量提取及聚類 44
3.4.1 簡介 44
3.4.2 聚類純化 44
3.4.3 真實標註評估方法 45
3.4.4 實驗方法. 47
3.4.5 聚類演算法 49
3.4.6 實驗結果分析 50
3.5 語音基石模型用於階段性系統 53
3.5.1 簡介 53
3.5.2 實驗方法 53
3.5.3 實驗設定 54
3.5.4 實驗結果分析 56
3.6 本章總結 57
第四章 端到端-階段性混合系統 59
4.1 語音基石模型用於EEND-VC 59
4.1.1 簡介 59
4.1.2 通用模型架構 59
4.1.3 實驗方法及設定 61
4.1.4 實驗結果分析 64
4.2 移除語者特徵向量預測目標 66
4.2.1 實驗方法 66
4.2.2 實驗結果分析 67
4.3 移除連接限制 69
4.3.1 簡介 69
4.3.2 實驗方法及設定 69
4.3.3 實驗結果分析 71
4.4 綜合表現比較 73
4.4.1 實驗設定 73
4.4.2 實驗結果分析 73
4.4.3 運算成本分析 74
4.5 本章總結 78
第五章 結論與展望 79
5.1 研究貢獻與討論 79
5.2 未來展望 81
參考文獻 83
-
dc.language.isozh_TW-
dc.title基於語音基石模型之語者自動分段標記系統zh_TW
dc.titleImproved Speaker Diarization Based on Speech Foundation Modelsen
dc.typeThesis-
dc.date.schoolyear112-1-
dc.description.degree碩士-
dc.contributor.oralexamcommittee曹昱;蔡宗翰;賴穎暉zh_TW
dc.contributor.oralexamcommitteeYu Tsao;Tzong-Han Tsai;Ying-Hui Laien
dc.subject.keyword語者自動分段標記,語音基石模型,zh_TW
dc.subject.keywordspeaker diarization,speech foundation model,en
dc.relation.page97-
dc.identifier.doi10.6342/NTU202400179-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2024-02-02-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電機工程學系-
顯示於系所單位:電機工程學系

文件中的檔案:
檔案 大小格式 
ntu-112-1.pdf
授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務)
1.52 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved