請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97096完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 李宏毅 | zh_TW |
| dc.contributor.advisor | Hung-yi Lee | en |
| dc.contributor.author | 張凱爲 | zh_TW |
| dc.contributor.author | Kai-Wei Chang | en |
| dc.date.accessioned | 2025-02-27T16:10:30Z | - |
| dc.date.available | 2025-02-28 | - |
| dc.date.copyright | 2025-02-27 | - |
| dc.date.issued | 2025 | - |
| dc.date.submitted | 2025-02-08 | - |
| dc.identifier.citation | Z. K. Abdul and A. K. Al-Talabani. Mel frequency cepstral coefficient and its applications: A review. IEEE Access, 10:122136–122158, 2022.
A. Aghajanyan, L. Zettlemoyer, and S. Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv preprint arXiv:2012.13255, 2020. A. Ahamad, A. Anand, and P. Bhargava. Accentdb: A database of non-native english accents to assist neural speech recognition. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5351–5358, 2020. Anthropic. Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku, October 2024. Accessed: 2024-12-15. Anthropic. Meet claude, 2024. Accessed: 2024-12-15. S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou. Neural voice cloning with a few samples. Advances in neural information processing systems, 31, 2018. S. Arora, H. Futami, J.-w. Jung, Y. Peng, R. Sharma, Y. Kashiwagi, E. Tsunoo, K. Livescu, and S. Watanabe. Universlu: Universal spoken language understanding for diverse tasks with natural language instructions. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2754–2774, 2024. A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli. Xls-r: Selfsupervised cross-lingual speech representation learning at scale. In Interspeech 2022, pages 2278–2282, 2022. A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli. XLS-R: self-supervised cross-lingual speech representation learning at scale. In INTERSPEECH, pages 2278–2282. ISCA, 2022. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460, 2020. D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015. J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y. Zhang, Z. Zhang, C. Zhou, J. Zhou, X. Zhou, and T. Zhu. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023. P. Baldi. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of ICML workshop on unsupervised and transfer learning, pages 37–49. JMLR Workshop and Conference Proceedings, 2012. L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, N. Dong, M. Duppenthaler, P.-A. Duquenne, B. Ellis, H. Elsahar, J. Haaheim, et al. Seamless: Multilingual expressive and streaming speech translation. arXiv preprint arXiv:2312.05187, 2023. L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, N. Dong, P.-A. Duquenne, H. Elsahar, H. Gong, K. Heffernan, J. Hoffman, et al. Seamlessm4t-massively multilingual & multimodal machine translation. arXiv preprint arXiv:2308.11596, 2023. M. K. Baskar, T. Herzig, D. Nguyen, M. Diez, T. Polzehl, L. Burget, and J. Černocký. Speaker adaptation for wav2vec2 based dysarthric asr. In Interspeech 2022, pages 3403–3407, 2022. L. T. Benamer and O. A. Alkishriwo. Database for Arabic speech commands recognition. In CEST, 2020. R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi, and N. Zeghidour. AudioLM: A language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2523–2533, 2023. Z. Borsos, M. Sharifi, D. Vincent, E. Kharitonov, N. Zeghidour, and M. Tagliasac chi. SoundStorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636, 2023. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335–359, 2008. M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), pages 132–149, 2018. S. Castro, D. Hazarika, V. Pérez-Rosas, R. Zimmermann, R. Mihalcea, and S. Poria. Towards multimodal sarcasm detection (an _obviously_ perfect paper). In Proceedings of the 57th Conference of ACL, pages 4619–4629, 2019. W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals. Listen, attend and spell. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4960–4964. IEEE, 2016. K.-W. Chang, M.-H. Chen, Y.-P. Lin, J. N. Hsu, P. K.-M. Huang, C.-Y. Huang, S.- W. Li, and H.-Y. Lee. Prompting and adapter tuning for self-supervised encoder decoder speech model. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8, 2023. K.-W. Chang, M.-H. Hsu, S.-W. Li, and H. yi Lee. Exploring in-context learning of textless speech language model for speech classification tasks. In Interspeech 2024, pages 4139–4143, 2024. K.-W. Chang, W.-C. Tseng, S.-W. Li, and H. yi Lee. An Exploration of Prompt Tuning on Generative Spoken Language Model for Speech Processing Tasks. In Proc. Interspeech 2022, pages 5005–5009, 2022. K.-W. Chang, Y.-K. Wang, H. Shen, I. thing Kang, W.-C. Tseng, S.-W. Li, and H. yi Lee. SpeechPrompt v2: Prompt tuning for speech classification tasks, 2023. K.-W. Chang, H. Wu, Y.-K. Wang, Y.-K. Wu, H. Shen, W.-C. Tseng, I.-t. Kang, S.-W. Li, and H.-y. Lee. Speechprompt: Prompting speech language models for speech processing tasks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024. X. Chang, B. Yan, K. Choi, J.-W. Jung, Y. Lu, S. Maiti, R. Sharma, J. Shi, J. Tian, S. Watanabe, Y. Fujita, T. Maekaku, P. Guo, Y.-F. Cheng, P. Denisov, K. Saijo, and H.-H. Wang. Exploring speech recognition, translation, and understanding with discrete speech units: A comparative study. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 11481–11485. IEEE, 2024. X. Chang, B. Yan, Y. Fujita, T. Maekaku, and S. Watanabe. Exploration of efficient end-to-end ASR using discretized input from self-supervised learning. In Proc. Interspeech 2023, pages 1399–1403, 2023. A. Chen, Y. Yao, P.-Y. Chen, Y. Zhang, and S. Liu. Understanding and improving visual prompting: A label-mapping perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19133–19143, 2023. M. Chen, J. Du, R. Pasunuru, T. Mihaylov, S. Iyer, V. Stoyanov, and Z. Kozareva. Improving in-context few-shot learning via self-supervised training. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3558–3573, 2022. S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505– 1518, 2022. S. Chen, Y. Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei. BEATs: Audio pre-training with acoustic tokenizers. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, International conference on machine learning, 23–29 Jul 2023. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, pages 1597–1607, 2020. Z.-C. Chen, C.-L. Fu, C.-Y. Liu, S.-W. D. Li, and H.-y. Lee. Exploring efficienttuning methods in self-supervised speech models. In 2022 IEEE spoken language technology workshop (SLT), pages 1120–1127. IEEE, 2023. C.-C. Chiu, J. Qin, Y. Zhang, J. Yu, and Y. Wu. Self-supervised learning with random-projection quantizer for speech recognition. In International Conference on Machine Learning, pages 3915–3924. PMLR, 2022. Y. Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou. Qwenaudio: Advancing universal audio understanding via unified large-scale audiolanguage models. arXiv preprint arXiv:2311.07919, 2023. Y.-A. Chung and J. Glass. Generative pre-training for speech with autoregressive predictive coding. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3497–3501. IEEE, 2020. Y.-A. Chung, Y. Zhang, W. Han, C.-C. Chiu, J. Qin, R. Pang, and Y. Wu. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250. IEEE, 2021. J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. Défossez. Simple and controllable music generation. In Advances in Neural Information Processing Systems, volume 36, 2024. G. Cui, S. Hu, N. Ding, L. Huang, and Z. Liu. Prototypical verbalizer for promptbased few-shot tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7014–7024, 2022. W. Cui, D. Yu, X. Jiao, Z. Meng, G. Zhang, Q. Wang, Y. Guo, and I. King. Recent advances in speech language models: A survey. arXiv preprint arXiv:2410.03751, 2024. D. Dai, Y. Sun, L. Dong, Y. Hao, S. Ma, Z. Sui, and F. Wei. Why can gpt learn incontext? language models implicitly perform gradient descent as meta-optimizers. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023. A. Défossez, J. Copet, G. Synnaeve, and Y. Adi. High fidelity neural audio compression. Transactions on Machine Learning Research, 2023. A. Défossez, L. Mazaré, M. Orsini, A. Royer, P. Pérez, H. Jégou, E. Grave, and N. Zeghidour. Moshi: a speech-text foundation model for real-time dialogue, 2024. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186, 2019. N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu, Y. Chen, C.-M. Chan, W. Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023. N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu, Y. Chen, C.-M. Chan, W. Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023. Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui. A survey on in-context learning. arXiv preprint arXiv:2301.00234, 2022. E. Dunbar, N. Hamilakis, and E. Dupoux. Self-supervised language learning from raw audio: Lessons from the zero resource speech challenge. IEEE Journal of Selected Topics in Signal Processing, 16(6):1211–1226, 2022. S. Evain, H. Nguyen, H. Le, M. Z. Boito, S. Mdhaffar, S. Alisamir, Z. Tong, N. A. Tomashenko, M. Dinarelli, T. Parcollet, A. Allauzen, Y. Estève, B. Lecouteux, F. Portet, S. Rossato, F. Ringeval, D. Schwab, and L. Besacier. LeBenchmark: A reproducible framework for assessing self-supervised representation learning from speech. In Proc. Interspeech 2021, pages 1439–1443, 2021. Q. Fang, S. Guo, Y. Zhou, Z. Ma, S. Zhang, and Y. Feng. Llama-omni: Seamless speech interaction with large language models. arXiv preprint arXiv:2409.06666, 2024. E. Fonseca, J. Pons, X. Favory, F. Font, D. Bogdanov, A. Ferraro, S. Oramas, A. Porter, and X. Serra. Freesound Datasets: A platform for the creation of open audio datasets. In ISMIR, pages 486–493, 2017. P. Gage. A new algorithm for data compression. C Users J., 12(2):23–38, Feb. 1994. H. Gao, J. Ni, K. Qian, Y. Zhang, S. Chang, and M. Hasegawa-Johnson. Wavprompt: Towards few-shot spoken language understanding with frozen language models. In INTERSPEECH, pages 2738–2742. ISCA, 2022. T. Gao, A. Fisch, and D. Chen. Making pre-trained language models better fewshot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, 2021. Y. Gong, A. H. Liu, H. Luo, L. Karlinsky, and J. R. Glass. Joint audio and speech understanding. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023, Taipei, Taiwan, December 16-20, 2023, pages 1–8. IEEE, 2023. A. Grattafiori, Z. DeVito, Z. Rosenbrick, Z. Wen, Z. Yang, Z. Zhao, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376, 2006. A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6645–6649. IEEE, 2013. R. Gray. Vector quantization. IEEE ASSP Magazine, 1(2):4–29, 1984. J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020. Y. Gu, L. Dong, F. Wei, and M. Huang. Pre-training to learn in context. In ACL (1), pages 4849–4870. Association for Computational Linguistics, 2023. Y. Gu, X. Han, Z. Liu, and M. Huang. PPT: pre-trained prompt tuning for fewshot learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8410–8423, 2022. J. Gui, T. Chen, J. Zhang, Q. Cao, Z. Sun, H. Luo, and D. Tao. A survey on self-supervised learning: Algorithms, applications, and future trends. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. C. Han, Z. Wang, H. Zhao, and H. Ji. Explaining emergent in-context learning as kernel regression. arXiv preprint arXiv:2305.12766, 2023. Z. Han, C. Gao, J. Liu, J. Zhang, and S. Q. Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv preprint arXiv:2403.14608, 2024. M. R. Hasan, M. Jamil, M. Rahman, et al. Speaker identification using mel frequency cepstral coefficients. variations, 1(4):565–568, 2004. M. Hassid, T. Remez, T. A. Nguyen, I. Gat, A. Conneau, F. Kreuk, J. Copet, A. Défossez, G. Synnaeve, E. Dupoux, R. Schwartz, and Y. Adi. Textually pretrained speech language models. Advances in Neural Information Processing Systems, 36:63483–63501, 2023. J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and G. Neubig. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366, 2021. K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6):82–97, 2012. N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed. HuBERT: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. S. Hu, N. Ding, H. Wang, Z. Liu, J. Wang, J. Li, W. Wu, and M. Sun. Knowledgeable prompt-tuning: Incorporating knowledge into prompt verbalizer for text classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2225–2240, 2022. C.-y. Huang, K.-H. Lu, S.-H. Wang, C.-Y. Hsiao, C.-Y. Kuan, H. Wu, S. Arora, K.-W. Chang, J. Shi, Y. Peng, et al. Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12136–12140. IEEE, 2024. K. Ito and L. Johnson. The LJ speech dataset. https://keithito.com/ LJ-Speech-Dataset/, 2017. A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon. A survey on contrastive self-supervised learning. Technologies, 9(1):2, 2020. F. Jia, S. Majumdar, and B. Ginsburg. MarbleNet: Deep 1d time-channel separable convolutional neural network for voice activity detection. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6818–6822. IEEE, 2021. M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim. Visual prompt tuning. In European Conference on Computer Vision, pages 709– 727. Springer, 2022. Y. Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson, Z. Chen, and Y. Wu. Direct speech-to-speech translation with a sequence-to-sequence model. In Proc. Interspeech 2019, pages 1123–1127, 2019. Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. Lopez Moreno, Y. Wu, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in neural information processing systems, 31, 2018. J. Kahn, M. Riviere, W. Zheng, E. Kharitonov, Q. Xu, P.-E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, et al. Libri-Light: a benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673. IEEE, 2020. E. Kharitonov, J. Copet, K. Lakhotia, T. A. Nguyen, P. Tomasello, A. Lee, A. Elkahky, W.-N. Hsu, A. Mohamed, E. Dupoux, et al. textless-lib: a library for textless spoken language processing. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations, pages 1–9, 2022. E. Kharitonov, A. Lee, A. Polyak, Y. Adi, J. Copet, K. Lakhotia, T. A. Nguyen, M. Riviere, A. Mohamed, E. Dupoux, and W.-N. Hsu. Text-free prosody-aware generative spoken language modeling. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8666–8681, Dublin, Ireland, May 2022. Association for Computational Linguistics. E. Kharitonov, D. Vincent, Z. Borsos, R. Marinier, S. Girgin, O. Pietquin, M. Sharifi, M. Tagliasacchi, and N. Zeghidour. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Transactions of the Association for Computational Linguistics, 11:1703–1718, 2023. Y. Koizumi, K. Yatabe, H. Zen, and M. Bacchiani. Wavefit: An iterative and nonautoregressive neural vocoder based on fixed-point iteration. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 884–891. IEEE, 2023. A. Kolesau and D. Šešok. Unsupervised pre-training for voice activation. Applied Sciences, 10(23):8643, 2020. J. Kong, J. Kim, and J. Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33:17022–17033, 2020. F. Kreuk, A. Polyak, J. Copet, E. Kharitonov, T.-A. Nguyen, M. Rivière, W.-N. Hsu, A. Mohamed, E. Dupoux, and Y. Adi. Textless speech emotion conversion using discrete & decomposed representations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11200–11214, 2022. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012. T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, 2018. A. Kumar, K. Tan, Z. Ni, P. Manocha, X. Zhang, E. Henderson, and B. Xu. Torchaudio-squim: Reference-less speech quality and intelligibility measures in torchaudio. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. C. Lai, Y. Chuang, H. Lee, S. Li, and J. R. Glass. Semi-supervised spoken language understanding via self-supervised speech and language model pretraining. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7468–7472. IEEE, 2021. K. Lakhotia, E. Kharitonov, W.-N. Hsu, Y. Adi, A. Polyak, B. Bolte, T.-A. Nguyen, J. Copet, A. Baevski, A. Mohamed, et al. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021. H. Le, J. Pino, C. Wang, J. Gu, D. Schwab, and L. Besacier. Lightweight adapter tuning for multilingual speech translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 817–824, 2021. Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436–444, 2015. A. Lee, P.-J. Chen, C. Wang, J. Gu, S. Popuri, X. Ma, A. Polyak, Y. Adi, Q. He, Y. Tang, et al. Direct speech-to-speech translation with discrete units. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 3327–3339, 2022. A. Lee, H. Gong, P.-A. Duquenne, H. Schwenk, P.-J. Chen, C. Wang, S. Popuri, Y. Adi, J. Pino, J. Gu, et al. Textless speech-to-speech translation on real data. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 860–872, 2022. B. Lester, R. Al-Rfou, and N. Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3045–3059, 2021. M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, 2020. C. Li, H. Farkhoor, R. Liu, and J. Yosinski. Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838, 2018. X. L. Li and P. Liang. Prefix-Tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021. Y.-Y. Lin, W.-Z. Zheng, W. C. Chu, J.-Y. Han, Y.-H. Hung, G.-M. Ho, C.-Y. Chang, and Y.-H. Lai. A speech command control-based recognition system for dysarthric patients based on deep learning technology. Applied Sciences, 11(6):2477, 2021. Y.-Y. Lin, W.-Z. Zheng, W. C. Chu, J.-Y. Han, Y.-H. Hung, G.-M. Ho, C.-Y. Chang, and Y.-H. Lai. A speech command control-based recognition system for dysarthric patients based on deep learning technology. Applied Sciences, 2021. S. Ling and Y. Liu. DeCoAR 2.0: Deep contextualized acoustic representations with vector quantization. arXiv preprint arXiv:2012.06659, 2020. A. T. Liu, S.-W. Li, and H.-y. Lee. TERA: Self-supervised learning of transformer encoder representation for speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2351–2366, 2021. P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023. X. Liu, K. Ji, Y. Fu, W. L. Tam, Z. Du, Z. Yang, and J. Tang. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021. X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang. Self-supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineering, 35(1):857–876, 2021. X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang. GPT understands, too. AI Open, 2023. L. Lugosch, M. Ravanelli, P. Ignoto, V. S. Tomar, and Y. Bengio. Speech model pretraining for end-to-end spoken language understanding. In G. Kubin and Z. Kacic, editors, Proc. Interspeech 2019, pages 814–818, 2019. K. MacLean. Voxforge, 2018. Ken MacLean. [Online]. Available: http:// www.voxforge.org/home.[Acedido em 2012]. L. Meng, L. Zhou, S. Liu, S. Chen, B. Han, S. Hu, Y. Liu, J. Li, S. Zhao, X. Wu, et al. Autoregressive speech synthesis without vector quantization. arXiv preprint arXiv:2407.08551, 2024. S. Min, M. Lewis, L. Zettlemoyer, and H. Hajishirzi. Metaicl: Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809, 2022. S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work?, 2022. I. Misra and L. v. d. Maaten. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6707–6717, 2020. A. Mohamed, H.-y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff, S.-W. Li, K. Livescu, L. Maaløe, et al. Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing, 16(6):1179– 1210, 2022. E. Morais, R. Hoory, W. Zhu, I. Gat, M. Damasceno, and H. Aronowitz. Speech emotion recognition using self-supervised features. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6922–6926. IEEE, 2022. P. Mousavi, L. Della Libera, J. Duret, A. Ploujnikov, C. Subakan, and M. Ravanelli. Dasb–discrete audio and speech benchmark. arXiv preprint arXiv:2406.14294, 2024. E. Nachmani, A. Levkovitch, R. Hirsch, J. Salazar, C. Asawaroengchai, S. Mariooryad, E. Rivlin, R. Skerry-Ryan, and M. T. Ramanovich. Spoken question answering and speech continuation using spectrogram-powered LLM. In International Conference on Learning Representations, 2024. A. Nautsch, A. Jiménez, A. Treiber, J. Kolberg, C. Jasserand, E. Kindt, H. Delgado, M. Todisco, M. A. Hmani, A. Mtibaa, et al. Preserving privacy in speaker and speech characterisation. Computer Speech & Language, 58:441–480, 2019. H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian. A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435, 2023. T. A. Nguyen, M. de Seyssel, P. Rozé, M. Rivière, E. Kharitonov, A. Baevski, E. Dunbar, and E. Dupoux. The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. arXiv preprint arXiv:2011.11588, 2020. T. A. Nguyen, E. Kharitonov, J. Copet, Y. Adi, W.-N. Hsu, A. Elkahky, P. Tomasello, R. Algayres, B. Sagot, A. Mohamed, et al. Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11:250–266, 2023. T. A. Nguyen, B. Muller, B. Yu, M. R. Costa-jussa, M. Elbayad, S. Popuri, P.-A. Duquenne, R. Algayres, R. Mavlyutov, I. Gat, G. Synnaeve, J. Pino, B. Sagot, and E. Dupoux. SpiRit-LM: Interleaved spoken and written language model. arXiv preprint arXiv:2402.05755, 2024. T. A. Nguyen, B. Sagot, and E. Dupoux. Are discrete units necessary for spoken language modeling? IEEE Journal of Selected Topics in Signal Processing, 16(6):1415–1423, 2022. C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022. OpenAI. Introducing ChatGPT, 2022. https://openai.com/blog/chatgpt. OpenAI. Gpt-4 technical report, 2023. OpenAI. Gpt-4o system card, August 2024. Accessed: 2024-12-15. OpenAI. Openai o1 system card, December 2024. Accessed: 2024-12-15. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: An ASR corpus based on public domain audio books. In ICASSP 2015 - 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206– 5210. IEEE, 2015. J. Pfeiffer, A. Kamath, A. Rücklé, K. Cho, and I. Gurevych. Adapterfusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247, 2020. S. Pichai, D. Hassabis, and K. Kavukcuoglu. Introducing gemini 2.0: Our new ai model for the agentic era, December 2024. Accessed: 2024-12-15. A. Polyak, Y. Adi, J. Copet, E. Kharitonov, K. Lakhotia, W. Hsu, A. Mohamed, and E. Dupoux. Speech resynthesis from discrete disentangled self-supervised representations. In Proc. Interspeech 2021, pages 3615–3619, 2021. S. Popuri, P.-J. Chen, C. Wang, J. Pino, Y. Adi, J. Gu, W.-N. Hsu, and A. Lee. Enhanced direct speech-to-speech translation using self-supervised pre-training and data augmentation. In Proc. Interspeech 2022, pages 5195–5199, 2022. V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert. MLS: A large-scale multilingual dataset for speech research. In INTERSPEECH, pages 2757–2761. ISCA, 2020. L. Rabiner and B.-H. Juang. Fundamentals of speech recognition. Prentice-Hall, Inc., 1993. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, 2023. A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by generative pre-training. OpenAI, 2018. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. OpenAI, 2019. Technical report. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. A. Ray, S. Mishra, A. Nunna, and P. Bhattacharyya. A multimodal corpus for emotion recognition in sarcasm. In Proceedings of the Thirteenth LREC, pages 6992–7003, 2022. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation, parallel distributed processing, explorations in the microstructure of cognition, ed. de rumelhart and j. mcclelland. vol. 1. 1986. Biometrika, 71(599-607):6, 1986. P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, and A. Chadha. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv preprint arXiv:2402.07927, 2024. T. Schick, H. Schmid, and H. Schütze. Automatically identifying words that can serve as labels for few-shot text classification. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5569–5578, 2020. T. Schick and H. Schütze. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255–269, 2021. T. Schick and H. Schütze. It’s not just size that matters: Small language models are also few-shot learners. In NAACL-HLT, pages 2339–2352. Association for Computational Linguistics, 2021. S. Schulhoff, M. Ilie, N. Balepur, K. Kahadze, A. Liu, C. Si, Y. Li, A. Gupta, H. Han, S. Schulhoff, et al. The prompt report: A systematic survey of prompting techniques. arXiv preprint arXiv:2406.06608, 2024. F. Shen, Y. Guo, C. Du, X. Chen, and K. Yu. Acoustic bpe for speech generation with discrete tokens. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 11746–11750, 2024. J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al. Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE, 2018. J. Shi, J. Tian, Y. Wu, J.-w. Jung, J. Q. Yip, Y. Masuyama, W. Chen, Y. Wu, Y. Tang, M. Baali, et al. Espnet-codec: Comprehensive training and evaluation of neural codecs for audio, music, and speech. arXiv preprint arXiv:2409.15897, 2024. T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, and S. Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222–4235, 2020. J. Shor, A. Jansen, W. Han, D. Park, and Y. Zhang. Universal paralinguistic speech representations using self-supervised conformers. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3169–3173. IEEE, 2022. A. Sicherman and Y. Adi. Analysing discrete self supervised speech representation for spoken language modeling. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. T. Sun, Y. Shao, H. Qian, X. Huang, and X. Qiu. Black-box tuning for languagemodel-as-a-service. In International Conference on Machine Learning, pages 20841–20855. PMLR, 2022. I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. X. Tan, T. Qin, F. Soong, and T.-Y. Liu. A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561, 2021. C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. MA, and C. Zhang. SALMONN: Towards generic hearing abilities for large language models. In The Twelfth International Conference on Learning Representations, 2024. R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Stanford alpaca: An instruction-following llama model. https:// github.com/tatsu-lab/stanford_alpaca, 2023. G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. G. Team, P. Georgiev, V. I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. Y. Tian and P. J. Gorinski. Improving end-to-end speech-to-intent classification with reptile. In Proc. Interspeech 2020, pages 891–895, 2020. K. Tomanek, V. Zayats, D. Padfield, K. Vaillancourt, and F. Biadsy. Residual adapters for parameter-efficient asr adaptation to atypical and accented speech. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6751–6760, 2021. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. H. Tsai, H. Chang, W. Huang, Z. Huang, K. Lakhotia, S. Yang, S. Dong, A. T. Liu, C. Lai, J. Shi, X. Chang, P. Hall, H. Chen, S. Li, S. Watanabe, A. Mohamed, and H. Lee. SUPERB-SG: enhanced speech processing universal performance benchmark for semantic and generative capabilities. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8479–8492, 2022. Y.-Y. Tsai, P.-Y. Chen, and T.-Y. Ho. Transfer learning without knowing: Reprogramming black-box machine learning models with scarce data and limited resources. In International Conference on Machine Learning, pages 9614–9624. PMLR, 2020. A. van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. In Proceedings of the International Conference on Learning Representations (ICLR), 2018. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017. A. Vyas, B. Shi, M. Le, A. Tjandra, Y.-C. Wu, B. Guo, J. Zhang, X. Zhang, R. Adkins, W. Ngan, et al. Audiobox: Unified audio generation with natural language prompts. arXiv preprint arXiv:2312.15821, 2023. R. Vygon and N. Mikhaylovskiy. Learning efficient representations for keyword spotting with triplet loss. In Speech and Computer: 23rd International Conference, SPECOM 2021, St. Petersburg, Russia, September 27–30, 2021, Proceedings 23, pages 773–785. Springer, 2021. C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023. C. Wang, A. Wu, J. Gu, and J. Pino. CoVoST 2 and massively multilingual speech translation. In Proc. Interspeech 2021, pages 2247–2251, 2021. Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International conference on machine learning, pages 5180–5189. PMLR, 2018. Z. Wang, R. Panda, L. Karlinsky, R. Feris, H. Sun, and Y. Kim. Multitask prompt tuning enables parameter-efficient transfer learning. In ICLR. OpenReview.net, 2023. P. Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018. T. Webb, K. J. Holyoak, and H. Lu. Emergent analogical reasoning in large language models. arXiv preprint arXiv:2212.09196, 2022. J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. J. Wei, J. Wei, Y. Tay, D. Tran, A. Webson, Y. Lu, X. Chen, H. Liu, D. Huang, D. Zhou, et al. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023. D. Wells, H. Tang, and K. Richmond. Phonetic analysis of self-supervised representations of english speech. In Proc. Interspeech 2022, pages 3583–3587, 2022. WhisperSpeech. Whisperspeech: An open source text-to-speech system built by inverting whisper. https://github.com/collabora/WhisperSpeech, 2024. Accessed: 2024-12-17. F. Wu, K. Kim, S. Watanabe, K. J. Han, R. McDonald, K. Q. Weinberger, and Y. Artzi. Wav2Seq: pre-training speech-to-text encoder-decoder models using pseudo languages. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. H. Wu, K.-W. Chang, Y.-K. Wu, and H.-y. Lee. Speechgen: Unlocking the generative power of speech language models with prompts. arXiv preprint arXiv:2306.02207, 2023. H. Wu, X. Chen, Y.-C. Lin, K. Chang, J. Du, K.-H. Lu, A. H. Liu, H.-L. Chung, Y.-K. Wu, D. Yang, et al. Codec-superb@ slt 2024: A lightweight benchmark for neural audio codec models. arXiv preprint arXiv:2409.14085, 2024. H. Wu, X. Chen, Y.-C. Lin, K.-w. Chang, H.-L. Chung, A. H. Liu, and H.-y. Lee. Towards audio language modeling-an overview. arXiv preprint arXiv:2402.13236, 2024. H. Wu, H.-L. Chung, Y.-C. Lin, Y.-K. Wu, X. Chen, Y.-C. Pai, H.-H. Wang, K.- W. Chang, A. Liu, and H.-y. Lee. Codec-SUPERB: An in-depth analysis of sound codec models. In L.-W. Ku, A. Martins, and V. Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 10330–10348, Bangkok, Thailand, Aug. 2024. Association for Computational Linguistics. S. M. Xie, A. Raghunathan, P. Liang, and T. Ma. An explanation of in-context learning as implicit bayesian inference. In ICLR. OpenReview.net, 2022. Z. Xie and C. Wu. Mini-omni: Language models can hear, talk while thinking in streaming. arXiv preprint arXiv:2408.16725, 2024. Z. Xie and C. Wu. Mini-omni2: Towards open-source gpt-4o model with vision, speech and duplex. arXiv preprint arXiv:2410.11190, 2024. L. Xu, H. Xie, S.-Z. J. Qin, X. Tao, and F. L. Wang. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. arXiv preprint arXiv:2312.12148, 2023. Y. Yan, X. Tan, B. Li, T. Qin, S. Zhao, Y. Shen, and T.-Y. Liu. Adaspeech 2: Adaptive text to speech with untranscribed data. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6613–6617. IEEE, 2021. S. Yang, P. Chi, Y. Chuang, C. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G. Lin, T. Huang, W. Tseng, K. Lee, D. Liu, Z. Huang, S. Dong, S. Li, S. Watanabe, A. Mohamed, and H. Lee. SUPERB: speech processing universal performance benchmark. In Proc. Interspeech 2021, pages 1194–1198, 2021. Y. Yang, F. Shen, C. Du, Z. Ma, K. Yu, D. Povey, and X. Chen. Towards universal speech discrete tokens: A case study for ASR and TTS. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10401–10405. IEEE, 2024. H. Yen, P.-J. Ku, C.-H. H. Yang, H. Hu, S. M. Siniscalchi, P.-Y. Chen, and Y. Tsao. Neural model reprogramming with similarity based mapping for low-resource spoken command recognition. In Proc. Interspeech 2023, pages 3317–3321, 2023. S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen. A survey on multimodal large language models. National Science Review, page nwae403, 2024. E. B. Zaken, S. Ravfogel, and Y. Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021. N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi. SoundStream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2022. A. Zeng, Z. Du, M. Liu, K. Wang, S. Jiang, L. Zhao, Y. Dong, and J. Tang. Glm4-voice: Towards intelligent and human-like end-to-end spoken chatbot. arXiv preprint arXiv:2412.02612, 2024. D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y. Zhou, and X. Qiu. SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. In Findings of the Association for Computational Linguistics: EMNLP 2023, 2023. H. Zhang, Y. Zou, and H. Wang. Contrastive self-supervised learning for textindependent speaker verification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6713–6717. IEEE, 2021. Y. Zhang, W. Han, J. Qin, Y. Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V. Axelrod, G. Wang, Z. Meng, K. Hu, A. Rosenberg, R. Prabhavalkar, D. S. Park, P. Haghani, J. Riesa, G. Perng, H. Soltau, T. Strohman, B. Ramabhadran, T. N. Sainath, P. J. Moreno, C. Chiu, J. Schalkwyk, F. Beaufays, and Y. Wu. Google USM: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037, 2023. W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, and J.-R. Zhang. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97096 | - |
| dc.description.abstract | 「預訓練-微調」(Pre-train, fine-tune)範式長期以來一直是語音處理領域的主流方法,其透過設計專家模型並微調預訓練語音表徵模型以應對各種下游語音處理任務。然而,隨著需要處理的下游任務數量的增加,此方法在擴展性上面臨重大挑戰,往往需要大量的人力、儲存資源以及計算成本。為了解決這些挑戰,本論文提出了一個高效且通用的語音處理框架,能以統一的方式處理多樣化的任務。
受「提示範式」(Prompting paradigm)在自然語言處理(NLP)領域的成功,以及近期以離散語音單元(Discrete speech unit)為基礎的語音語言模型(Speech LMs)的啟發,本論文率先將提示方法應用於語音處理領域。為此,我們提出了名為 SpeechPrompt 的統一提示框架,專為語音語言模型設計,能夠處理廣泛的任務。 提示方法於系統輸入端插入提示(Prompt)以指引語言模型,並利用語言模型的生成能力來解決多樣化的下游任務,此項設計使提示方法無需進行大量的模型重新設計,進而顯著降低計算需求與人力負擔。透過將語音處理任務重新定義為「語音到單元生成」(Speech-to-unit generation)任務,本研究展示了如何在 SpeechPrompt 框架內整合語音分類(Speech classification)、序列生成(Sequence generation)以及語音生成(Speech generation)任務,為語音處理領域提供了一個可擴展、高效且統一的解決方案。實驗結果顯示,比較提示範式與基於自監督學習(Self-supervised learning)模型的微調方法(Fine-tuning),在具有相似的可訓練參數(Trainable parameters)數量下,能夠達到具有競爭力的表現。此外,在少樣本學習(Few-shot learning)場景中,提示範式展現出相較微調方法更卓越的表現。 本論文亦探討了語言模型的兩大主要架構——編碼器-解碼器(Encoder-decoder)語言模型與解碼器(Decoder-only)語言模型——在提示框架中的應用。我們發現,在提示框架中,編碼器-解碼器語音語言模型的性能優於解碼器架構,這與自然語言處理領域的主流做法有所不同,後者主要聚焦於開發僅解碼器語言模型,此研究可作為後續開發語音語言模型的借鏡。 此外,本論文還研究了另一項重要的提示技術——上下文學習(In-Context Learning,ICL)。此技術可將範例資料作為輸入,讓語音語言模型無需任何額外的訓練即可學習範例資料的模式以執行新任務。實驗結果顯示,語音語言模型的上下文學習能夠展現與簡單監督式學習相當的性能。研究結果驗證了將上下文學習應用於語音處理的可行性,此研究為未來開發高效的語音模型奠定了基礎。 綜上所述,本論文首次系統性地探討了提示框架在語音語言模型中處理多樣化語音任務的應用。此研究深入探索了提示範式於語音處理領域的可行性,為未來語音語言模型的開發提供了重要的參考。所提出的提示框架展現了極大的研究潛力與應用價值,為語音處理技術的進一步發展奠定了基礎。 | zh_TW |
| dc.description.abstract | The "pre-train, fine-tune" paradigm has long been the dominant approach in the field of speech processing. It involves designing expert models and fine-tuning pre-trained speech representation models to serve various downstream speech processing tasks. While effective, this approach faces significant challenges when the number of downstream tasks to be served increases, as it demands considerable human effort, storage capacity, and computational resources. To address these limitations, this thesis proposes the development of an efficient and universal speech processing framework capable of addressing diverse tasks in a unified manner.
Inspired by the success of the "prompting paradigm" in Natural Language Processing (NLP) and the recent advancements in Speech Language Models (Speech LMs) trained on quantized discrete speech units, this thesis pioneers the application of prompting to speech processing. To this end, we introduce SpeechPrompt, a unified prompting framework designed for Speech LMs to handle a broad spectrum of tasks. Prompting focuses solely on modifying the input, leveraging the generative capabilities of LMs to tackle diverse downstream tasks without extensive model redesigning. This significantly reduces computational demands and human effort. By reformulating speech processing tasks into speech-to-unit generation, this research demonstrates the seamless integration of speech classification, sequence generation, and speech generation tasks within the SpeechPrompt framework, offering a scalable, efficient, and unified solution to the speech processing field. Experimental results show that, with a similar number of trainable parameters, the prompting method achieves competitive performance compared to fine-tuning approaches that are based on self-supervised learning models. Additionally, prompting demonstrates strong performance in few-shot learning scenarios. This thesis also investigates two major architectures of language models within the prompting paradigm: "Encoder-decoder LMs" and "Decoder-only LMs". Our findings reveal that encoder-decoder speech LMs significantly outperform decoder-only models within the prompting framework, contrasting with the mainstream practice in NLP, which predominantly focuses on developing decoder-only LMs. Furthermore, this thesis explores another key prompting technique: In-Context Learning (ICL). This technique enables speech LMs to perform new tasks without additional training by learning patterns from input examples. Experimental results indicate that in-context learning achieves competitive performance compared to simple supervised baselines. The findings not only validate the feasibility of applying ICL to speech processing but also highlight its potential to reduce computational costs and improve adaptability across diverse tasks. These contributions lay the groundwork for future advancements in scalable and efficient speech model development. In summary, this thesis systematically explores the prompting frameworks in handling diverse speech tasks with speech language models. The research delves into the feasibility of applying the prompting paradigm to speech processing and provides valuable insights for the future development of speech language models. The proposed prompting framework demonstrates significant research potential and practical value, laying a solid foundation for further advancements in speech processing technology. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-02-27T16:10:30Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2025-02-27T16:10:30Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 致謝 i
摘要 iii Abstract v Contents ix List of Figures xiii List of Tables xix Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contribution 5 1.3 Overview 7 Chapter 2 Background 9 2.1 Speech Representation 12 2.1.1 Self-supervised Speech Representation Learning 12 2.1.2 Speech Representation for Speech Language Models 19 2.2 Speech Language Models 26 2.2.1 Textless Speech Language Models 33 2.2.2 Speech-Aware Text Language Models 35 2.2.3 Text-Aware Speech Language Models 38 2.2.4 Speech-Text Language Models 41 2.3 Pre-train, Fine-tune Paradigm 44 2.3.1 Model Tuning 47 2.3.2 Prompt Tuning (on Representation Model) 48 2.3.3 Adapter Tuning 51 2.4 Prompting Paradigm 54 2.4.1 Discrete Prompt Learning 57 2.4.2 Prompt Tuning (on Language Model) 59 2.4.3 In-Context Learning 64 2.5 Summary 68 Chapter 3 SpeechPrompt: Prompting Speech Language Models for Speech Processing 71 3.1 Introduction 72 3.2 Method 76 3.2.1 Unit Language Models 78 3.2.2 Prompt Tuning 81 3.2.3 Speech-to-Unit Generation 83 3.2.4 Verbalizer and Speech Decoder 83 3.2.5 Learnable Verbalizer 85 3.3 Experimental Setup 87 3.3.1 Tasks and Datasets 87 3.3.2 Model and Training Setup 91 3.4 Results 94 3.4.1 Main Results 94 3.4.2 Few-shot Learning 100 3.4.3 Verbalizer Analysis 102 3.5 Discussion 104 3.6 Summary 106 Chapter 4 Analysis of Architecture and Pre-training Tasks for Speech Language Models 109 4.1 Introduction 110 4.2 Method 113 4.2.1 Encoder-Decoder Speech Language Model 113 4.2.2 Prompting 113 4.2.3 Adapter tuning 114 4.3 Experimental Settings 115 4.3.1 Wav2Seq 116 4.3.2 Prompting paradigm 117 4.3.3 Adapter Tuning 118 4.4 Results 119 4.4.1 Prompting Encoder-Decoder Speech LM vs. Prompting Decoderonly Speech LM 119 4.4.2 Prompting for Cross-lingual Transfer Learning 121 4.4.3 Comparison of Prompting and Adapter Tuning 123 4.5 Summary 124 Chapter 5 In-context Learning for Speech Language Models 127 5.1 Introduction 128 5.2 Method 132 5.2.1 Warmup Training 133 5.2.2 In-context Learning 135 5.3 Experimental Setup 135 5.3.1 Tasks and Datasets 135 5.3.2 Implementation Detail 136 5.4 Results 137 5.4.1 Main Result 137 5.4.2 Model Behavior Analysis 138 5.4.3 Utterance Length Analysis 140 5.5 Summary 141 Chapter 6 Conclusion 143 6.1 Summary of Contributions 143 6.2 Future Directions 144 6.3 Closing Remarks 149 References 151 | - |
| dc.language.iso | en | - |
| dc.subject | 自監督式學習 | zh_TW |
| dc.subject | 參數高效率學習 | zh_TW |
| dc.subject | 提示學習 | zh_TW |
| dc.subject | 大型語言模型 | zh_TW |
| dc.subject | 語音語言模型 | zh_TW |
| dc.subject | Parameter-efficient Learning | en |
| dc.subject | Speech Language Model | en |
| dc.subject | Large Language Model | en |
| dc.subject | Prompting | en |
| dc.subject | Self-supervised Learning | en |
| dc.title | 邁向通用語音模型:提示語音語言模型於多樣語音處理任務 | zh_TW |
| dc.title | Towards a Universal Speech Model: Prompting Speech Language Models for Diverse Speech Processing Tasks | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 113-1 | - |
| dc.description.degree | 博士 | - |
| dc.contributor.oralexamcommittee | 李琳山;賴穎暉;林軒田;孫紹華;陳尚澤;王新民 | zh_TW |
| dc.contributor.oralexamcommittee | Lin-shan Lee;Ying-Hui Lai;Hsuan-Tien Lin;Shao-Hua Sun;Shang-Tse Chen;Hsin-Min Wang | en |
| dc.subject.keyword | 語音語言模型,大型語言模型,提示學習,自監督式學習,參數高效率學習, | zh_TW |
| dc.subject.keyword | Speech Language Model,Large Language Model,Prompting,Self-supervised Learning,Parameter-efficient Learning, | en |
| dc.relation.page | 181 | - |
| dc.identifier.doi | 10.6342/NTU202500479 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2025-02-10 | - |
| dc.contributor.author-college | 電機資訊學院 | - |
| dc.contributor.author-dept | 電信工程學研究所 | - |
| dc.date.embargo-lift | 2025-02-28 | - |
| 顯示於系所單位: | 電信工程學研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-113-1.pdf | 8.08 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
