Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資料科學學位學程
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86552
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor李宏毅(Hung-Yi Lee)
dc.contributor.authorChan-Jan Hsuen
dc.contributor.author許湛然zh_TW
dc.date.accessioned2023-03-20T00:02:44Z-
dc.date.copyright2022-08-18
dc.date.issued2022
dc.date.submitted2022-08-11
dc.identifier.citation[1] A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira,“Perceiver: General perception with iterative attention,” in Proceedings of the 38thInternational Conference on Machine Learning, ICML 2021, 18-24 July 2021,Virtual Event, ser. Proceedings of Machine Learning Research, M. Meila andT. Zhang, Eds., vol. 139. PMLR, 2021, pp. 4651–4664. [Online]. Available:http://proceedings.mlr.press/v139/jaegle21a.html [2] A. Jaegle, S. Borgeaud, J. Alayrac, C. Doersch, C. Ionescu, D. Ding, S. Koppula,D. Zoran, A. Brock, E. Shelhamer, O. J. Hénaff, M. M. Botvinick, A. Zisserman,O. Vinyals, and J. Carreira, “Perceiver IO: A general architecture for structuredinputs & outputs,” CoRR, vol. abs/2107.14795, 2021. [Online]. Available:https://arxiv.org/abs/2107.14795 [3] L. Deng, “The mnist database of handwritten digit images for machine learning research,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, 2012. [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A largescale hierarchical image database,” in 2009 IEEE conference on computer vision andpattern recognition. Ieee, 2009, pp. 248–255. [5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deepconvolutional neural networks,” in Advances in Neural Information ProcessingSystems 25: 26th Annual Conference on Neural Information Processing Systems2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada,United States, P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, and K. Q.Weinberger, Eds., 2012, pp. 1106–1114. [Online]. Available: https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html [6] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scaleimage recognition,” in 3rd International Conference on Learning Representations,ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,Y. Bengio and Y. LeCun, Eds., 2015. [Online]. Available: http://arxiv.org/abs/1409.1556 [7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 2016, pp.770–778. [Online]. Available: https://doi.org/10.1109/CVPR.2016.90 [8] S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residualtransformations for deep neural networks,” in 2017 IEEE Conference on ComputerVision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26,2017. IEEE Computer Society, 2017, pp. 5987–5995. [Online]. Available:https://doi.org/10.1109/CVPR.2017.634 [9] H. Zhang, C. Wu, Z. Zhang, Y. Zhu, Z. Zhang, H. Lin, Y. Sun, T. He, J. Mueller,R. Manmatha, M. Li, and A. J. Smola, “Resnest: Split-attention networks,” CoRR,vol. abs/2004.08955, 2020. [Online]. Available: https://arxiv.org/abs/2004.08955 [10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,and I. Polosukhin, “Attention is all you need,” in Advances in Neural InformationProcessing Systems 30: Annual Conference on Neural Information ProcessingSystems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. vonLuxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, andR. Garnett, Eds., 2017, pp. 5998–6008. [Online]. Available: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html [11] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deepbidirectional transformers for language understanding,” in Proceedings of the 2019Conference of the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis,MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein,C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, 2019,pp. 4171–4186. [Online]. Available: https://doi.org/10.18653/v1/n19-1423 [12] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault,R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu,C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Transformers:State-of-the-art natural language processing,” in Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Processing: System Demonstrations.Online: Association for Computational Linguistics, Oct. 2020, pp. 38–45. [Online].Available: https://www.aclweb.org/anthology/2020.emnlp-demos.6 [13] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: Amulti-task benchmark and analysis platform for natural language understanding,” inProceedings of the Workshop: Analyzing and Interpreting Neural Networksfor NLP,BlackboxNLP@EMNLP 2018, Brussels, Belgium, November 1, 2018, T. Linzen,G. Chrupala, and A. Alishahi, Eds. Association for Computational Linguistics,2018, pp. 353–355. [Online]. Available: https://doi.org/10.18653/v1/w18-5446 [14] R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi, “SWAG: A large-scale adversarialdataset for grounded commonsense inference,” in Proceedings of the 2018Conference on Empirical Methods in Natural Language Processing, Brussels,Belgium, October 31 - November 4, 2018, E. Riloff, D. Chiang, J. Hockenmaier,and J. Tsujii, Eds. Association for Computational Linguistics, 2018, pp. 93–104.[Online]. Available: https://doi.org/10.18653/v1/d18-1009 [15] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,”arXiv preprint arXiv:1907.11692, 2019. [16] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut,“ALBERT: A lite BERT for self-supervised learning of language representations,”in 8th International Conference on Learning Representations, ICLR 2020, AddisAbaba, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. [Online]. Available:https://openreview.net/forum?id=H1eA7AEtvS [17] K. Clark, M. Luong, Q. V. Le, and C. D. Manning, “ELECTRA: pre-training textencoders as discriminators rather than generators,” in 8th International Conferenceon Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,2020. OpenReview.net, 2020. [Online]. Available: https://openreview.net/forum?id=r1xMH1BtvB [18] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li,and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-texttransformer,” J. Mach. Learn. Res., vol. 21, pp. 140:1–140:67, 2020. [Online].Available: http://jmlr.org/papers/v21/20-074.html [19] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances inNeural Information Processing Systems 33: Annual Conference on NeuralInformation Processing Systems 2020, NeurIPS 2020, December 6-12, 2020,virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin,Eds., 2020. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html [20] A. Baevski, W. Hsu, A. Conneau, and M. Auli, “Unsupervised speech recognition,”in Advances in Neural Information Processing Systems 34: Annual Conferenceon Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14,2021, virtual, M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W.Vaughan, Eds., 2021, pp. 27 826–27 839. [Online]. Available: https://proceedings.neurips.cc/paper/2021/hash/ea159dc9788ffac311592613b7f71fbb-Abstract.html [21] W. Hsu, B. Bolte, Y. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed,“Hubert: Self-supervised speech representation learning by masked prediction ofhidden units,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 29, pp. 3451–3460, 2021. [Online]. Available: https://doi.org/10.1109/TASLP.2021.3122291 [22] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner,M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby,“An image is worth 16x16 words: Transformers for image recognition at scale,”in 9th International Conference on Learning Representations, ICLR 2021, VirtualEvent, Austria, May 3-7, 2021. OpenReview.net, 2021. [Online]. Available:https://openreview.net/forum?id=YicbFdNTTy [23] J. Andreoli, “Convolution is outer product,” CoRR, vol. abs/1905.01289, 2019.[Online]. Available: http://arxiv.org/abs/1905.01289 [24] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “VL-BERT: pre-trainingof generic visual-linguistic representations,” in 8th International Conference onLearning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,2020. OpenReview.net, 2020. [Online]. Available: https://openreview.net/forum?id=SygXPaEYvH [25] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnosticvisiolinguistic representations for vision-and-language tasks,” in Advancesin NeuralInformation Processing Systems 32: Annual Conference on Neural InformationProcessing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC,Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox,and R. Garnett, Eds., 2019, pp. 13–23. [Online]. Available: https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html [26] H. Tan and M. Bansal, “LXMERT: Learning cross-modality encoder representationsfrom transformers,” in Proceedings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th International Joint Conference onNatural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Associationfor Computational Linguistics, Nov. 2019, pp. 5100–5111. [Online]. Available:https://aclanthology.org/D19-1514 [27] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry,A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferablevisual models from natural language supervision,” in Proceedings of the 38thInternational Conference on Machine Learning, ICML 2021, 18-24 July 2021,Virtual Event, ser. Proceedings of Machine Learning Research, M. Meila andT. Zhang, Eds., vol. 139. PMLR, 2021, pp. 8748–8763. [Online]. Available:http://proceedings.mlr.press/v139/radford21a.html [28] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards realtime object detection with region proposal networks,” in Advances in NeuralInformation Processing Systems 28: Annual Conference on Neural InformationProcessing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada,C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds.,2015, pp. 91–99. [Online]. Available: https://proceedings.neurips.cc/paper/2015/hash/14bfa6bb14875e45bba028a21ed38046-Abstract.html [29] Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis,and L. Zettlemoyer, “Multilingual denoising pre-training for neural machinetranslation,” Trans. Assoc. Comput. Linguistics, vol. 8, pp. 726–742, 2020.[Online]. Available: https://doi.org/10.1162/tacl_a_00343 [30] G. Lin, Y. Chuang, H. Chung, S. Yang, H. Chen, S. Dong, S. Li, A. Mohamed,H. Lee, and L. Lee, “DUAL: discrete spoken unit adaptive learning for textlessspoken question answering,” CoRR, vol. abs/2203.04911, 2022. [Online]. Available:https://doi.org/10.48550/arXiv.2203.04911 [31] Y. Tang, H. Gong, N. Dong, C. Wang, W.-N. Hsu, J. Gu, A. Baevski, X. Li,A. Mohamed, M. Auli, and J. Pino, “Unified speech-text pre-training for speechtranslation and recognition,” in Proceedings of the 60th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: Long Papers). Dublin,Ireland: Association for Computational Linguistics, May 2022, pp. 1488–1499.[Online]. Available: https://aclanthology.org/2022.acl-long.105 [32] A. Bapna, Y. Chung, N. Wu, A. Gulati, Y. Jia, J. H. Clark, M. Johnson, J. Riesa,A. Conneau, and Y. Zhang, “SLAM: A unified encoder for speech and languagemodeling via speech-text joint pre-training,” CoRR, vol. abs/2110.10329, 2021.[Online]. Available: https://arxiv.org/abs/2110.10329 [33] W. Jin, D.-H. Lee, C. Zhu, J. Pujara, and X. Ren, “Leveraging visual knowledge inlanguage tasks: An empirical study on intermediate pre-training for cross-modalknowledge transfer,” in Proceedings of the 60th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland:Association for Computational Linguistics, May 2022, pp. 2750–2762. [Online].Available: https://aclanthology.org/2022.acl-long.196 [34] B. Y. Lin, S. Lee, R. Khanna, and X. Ren, “Birds have four legs?! NumerSense:Probing Numerical Commonsense Knowledge of Pre-Trained Language Models,”in Proceedings of the 2020 Conference on Empirical Methods in Natural LanguageProcessing (EMNLP). Online: Association for Computational Linguistics,Nov. 2020, pp. 6862–6868. [Online]. Available: https://aclanthology.org/2020.emnlp-main.557 [35] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deepbidirectional transformers for language understanding,” in Proceedings of the 2019Conference of the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Volume 1 (Long and Short Papers).Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp.4171–4186. [Online]. Available: https://aclanthology.org/N19-1423 [36] T. Gupta, A. G. Schwing, and D. Hoiem, “Vico: Word embeddings from visualco-occurrences,” in 2019 IEEE/CVF International Conference on Computer Vision,ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 2019,pp. 7424–7433. [Online]. Available: https://doi.org/10.1109/ICCV.2019.00752 [37] Z. Zhang, K. Chen, R. Wang, M. Utiyama, E. Sumita, Z. Li, and H. Zhao, “Neuralmachine translation with universal visual representation,” in 8th InternationalConference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April26-30, 2020. OpenReview.net, 2020. [Online]. Available: https://openreview.net/forum?id=Byl8hhNYPS [38] P.-Y. Huang, J. Hu, X. Chang, and A. Hauptmann, “Unsupervised multimodalneural machine translation with pseudo visual pivoting,” in Proceedings of the58th Annual Meeting of the Association for Computational Linguistics. Online:Association for Computational Linguistics, Jul. 2020, pp. 8226–8237. [Online].Available: https://aclanthology.org/2020.acl-main.731 [39] H. Tan and M. Bansal, “Vokenization: Improving language understandingwith contextualized, visual-grounded supervision,” in Proceedings of the 2020Conference on Empirical Methods in Natural Language Processing (EMNLP).Online: Association for Computational Linguistics, Nov. 2020, pp. 2066–2080.[Online]. Available: https://aclanthology.org/2020.emnlp-main.162 [40] E. Karpas, O. Abend, Y. Belinkov, B. Lenz, O. Lieber, N. Ratner, Y. Shoham,H. Bata, Y. Levine, K. Leyton-Brown, D. Muhlgay, N. Rozen, E. Schwartz,G. Shachaf, S. Shalev-Shwartz, A. Shashua, and M. Tennenholtz, “MRKL systems:A modular, neuro-symbolic architecture that combines large language models,external knowledge sources and discrete reasoning,” CoRR, vol. abs/2205.00445,2022. [Online]. Available: https://doi.org/10.48550/arXiv.2205.00445 [41] C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson,J. Pino, and E. Dupoux, “VoxPopuli: A large-scale multilingual speech corpus forrepresentation learning, semi-supervised learning and interpretation,” in Proceedingsof the 59th Annual Meeting of the Association for Computational Linguistics andthe 11th International Joint Conference on Natural Language Processing (Volume 1:Long Papers). Online: Association for Computational Linguistics, Aug. 2021, pp.993–1003. [Online]. Available: https://aclanthology.org/2021.acl-long.80 [42] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child,S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neurallanguage models,” CoRR, vol. abs/2001.08361, 2020. [Online]. Available:https://arxiv.org/abs/2001.08361 [43] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford,D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland,K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan,E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre, “Training compute-optimallarge language models,” CoRR, vol. abs/2203.15556, 2022. [Online]. Available:https://doi.org/10.48550/arXiv.2203.15556 [44] D. Liu, K. Chen, H. Lee, and L. Lee, “Completely unsupervised phonemerecognition by adversarially learning mapping relationships from audio embeddings,” in Interspeech 2018, 19th Annual Conference of the InternationalSpeech Communication Association, Hyderabad, India, 2-6 September 2018,B. Yegnanarayana, Ed. ISCA, 2018, pp. 3748–3752. [Online]. Available:https://doi.org/10.21437/Interspeech.2018-1800 [45] K. Chen, C. Tsai, D. Liu, H. Lee, and L. Lee, “Completely unsupervised phonemerecognition by A generative adversarial network harmonized with iteratively refinedhidden markov models,” CoRR, vol. abs/1904.04100, 2019. [Online]. Available:http://arxiv.org/abs/1904.04100 [46] A. Baevski, W. Hsu, A. Conneau, and M. Auli, “Unsupervised speechrecognition,” CoRR, vol. abs/2105.11084, 2021. [Online]. Available: https://arxiv.org/abs/2105.11084 [47] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpusbased on public domain audio books,” in 2015 IEEE International Conferenceon Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane,Queensland, Australia, April 19-24, 2015. IEEE, 2015, pp. 5206–5210. [Online].Available: https://doi.org/10.1109/ICASSP.2015.7178964 [48] F. Hernandez, V. Nguyen, S. Ghannay, N. A. Tomashenko, and Y. Estève,“TED-LIUM 3: Twice as much data and corpus repartition for experiments onspeaker adaptation,” in Speech and Computer - 20th International Conference,SPECOM 2018, Leipzig, Germany, September 18-22, 2018, Proceedings, ser.Lecture Notes in Computer Science, A. Karpov, O. Jokisch, and R. Potapova,Eds., vol. 11096. Springer, 2018, pp. 198–208. [Online]. Available: https://doi.org/10.1007/978-3-319-99579-3_21 [49] J. J. Godfrey, E. Holliman, and J. McDaniel, “SWITCHBOARD: telephone speechcorpus for research and development,” in 1992 IEEE International Conference onAcoustics, Speech, and Signal Processing, ICASSP ’92, San Francisco, California,USA, March 23-26, 1992. IEEE Computer Society, 1992, pp. 517–520. [Online].Available: https://doi.org/10.1109/ICASSP.1992.225858 [50] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, andC. L. Zitnick, “Microsoft COCO: common objects in context,” in Computer Vision- ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12,2014, Proceedings, Part V, ser. Lecture Notes in Computer Science, D. J. Fleet,T. Pajdla, B. Schiele, and T. Tuytelaars, Eds., vol. 8693. Springer, 2014, pp.740–755. [Online]. Available: https://doi.org/10.1007/978-3-319-10602-1_48 [51] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: Acleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proceedings of the 56th Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers). Melbourne, Australia: Associationfor Computational Linguistics, Jul. 2018, pp. 2556–2565. [Online]. Available:https://aclanthology.org/P18-1238 [52] S. Shon, A. Pasad, F. Wu, P. Brusco, Y. Artzi, K. Livescu, and K. J. Han, “SLUE: newbenchmark tasks for spoken language understanding evaluation on natural speech,”in IEEE International Conference on Acoustics, Speech and Signal Processing,ICASSP 2022, Virtual and Singapore, 23-27 May 2022. IEEE, 2022, pp. 7927–7931. [Online]. Available: https://doi.org/10.1109/ICASSP43922.2022.9746137 [53] W.-T. Kao and H.-y. Lee, “Is BERT a cross-disciplinary knowledge learner?a surprising finding of pre-trained models’ transferability,” in Findings of theAssociation for Computational Linguistics: EMNLP 2021. Punta Cana, DominicanRepublic: Association for Computational Linguistics, Nov. 2021, pp. 2195–2208.[Online]. Available: https://aclanthology.org/2021.findings-emnlp.189 [54] X. Li, C. Wang, Y. Tang, C. Tran, Y. Tang, J. Pino, A. Baevski, A. Conneau,and M. Auli, “Multilingual speech translation from efficient finetuning ofpretrained models,” in Proceedings of the 59th Annual Meeting of the Associationfor Computational Linguistics and the 11th International Joint Conference onNatural Language Processing (Volume 1: Long Papers). Online: Associationfor Computational Linguistics, Aug. 2021, pp. 827–838. [Online]. Available:https://aclanthology.org/2021.acl-long.68 [55] H. Le, J. Pino, C. Wang, J. Gu, D. Schwab, and L. Besacier, “Lightweightadapter tuning for multilingual speech translation,” in Proceedings of the 59thAnnual Meeting of the Association for Computational Linguistics and the 11thInternational Joint Conference on Natural Language Processing (Volume 2: ShortPapers). Online: Association for Computational Linguistics, Aug. 2021, pp.817–824. [Online]. Available: https://aclanthology.org/2021.acl-short.103 [56] Y. Tang, J. Pino, X. Li, C. Wang, and D. Genzel, “Improving speech translation byunderstanding and learning from the auxiliary text translation task,” in Proceedingsof the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1:Long Papers). Online: Association for Computational Linguistics, Aug. 2021, pp.4252–4261. [Online]. Available: https://aclanthology.org/2021.acl-long.328 [57] Y. Tang, J. M. Pino, C. Wang, X. Ma, and D. Genzel, “A general multi-task learningframework to leverage text data for speech to text tasks,” in IEEE InternationalConference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto,ON, Canada, June 6-11, 2021. IEEE, 2021, pp. 6209–6213. [Online]. Available:https://doi.org/10.1109/ICASSP39728.2021.9415058 [58] P. He, X. Liu, J. Gao, and W. Chen, “Deberta: decoding-enhanced bert with disentangled attention,” in 9th International Conference on Learning Representations,ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.[Online]. Available: https://openreview.net/forum?id=XPZIaotutsD [59] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-documenttransformer,” CoRR, vol. abs/2004.05150, 2020. [Online]. Available: https://arxiv.org/abs/2004.05150 [60] Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis,and L. Zettlemoyer, “Multilingual denoising pre-training for neural machinetranslation,” Transactions of the Association for Computational Linguistics, vol. 8,pp. 726–742, 2020. [Online]. Available: https://aclanthology.org/2020.tacl-1.47 [61] L. Xue, A. Barua, N. Constant, R. Al-Rfou, S. Narang, M. Kale, A. Roberts,and C. Raffel, “ByT5: Towards a token-free future with pre-trained byte-to-bytemodels,” Transactions of the Association for Computational Linguistics, vol. 10,pp. 291–306, 2022. [Online]. Available: https://aclanthology.org/2022.tacl-1.17 [62] J. H. Clark, D. Garrette, I. Turc, and J. Wieting, “Canine: Pre-training anefficient tokenization-free encoder for language representation,” Transactions ofthe Association for Computational Linguistics, vol. 10, pp. 73–91, 2022. [Online].Available: https://aclanthology.org/2022.tacl-1.5 [63] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+ questionsfor machine comprehension of text,” in Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Processing. Austin, Texas: Association for Computational Linguistics, Nov. 2016, pp. 2383–2392. [Online]. Available:https://aclanthology.org/D16-1264 [64] A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, andK. Suleman, “NewsQA: A machine comprehension dataset,” in Proceedings ofthe 2nd Workshop on Representation Learning for NLP. Vancouver, Canada:Association for Computational Linguistics, Aug. 2017, pp. 191–200. [Online].Available: https://aclanthology.org/W17-2623 [65] E. Choi, H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y. Choi, P. Liang, andL. Zettlemoyer, “QuAC: Question answering in context,” in Proceedings ofthe 2018 Conference on Empirical Methods in Natural Language Processing.Brussels, Belgium: Association for Computational Linguistics, Oct.-Nov. 2018, pp.2174–2184. [Online]. Available: https://aclanthology.org/D18-1241 [66] C. Wang, A. Wu, J. Gu, and J. Pino, “Covost 2 and massively multilingual speechtranslation,” in Interspeech 2021, 22nd Annual Conference of the InternationalSpeech Communication Association, Brno, Czechia, 30 August - 3 September2021, H. Hermansky, H. Cernocký, L. Burget, L. Lamel, O. Scharenborg,and P. Motlícek, Eds. ISCA, 2021, pp. 2247–2251. [Online]. Available:https://doi.org/10.21437/Interspeech.2021-2027
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86552-
dc.description.abstract類神經網路模型架構的演進,以及繪圖處理器的發展,促成電腦得以完成許多人類的認知任務,有些甚至能超越人類表現。這些任務當中,有和人類與世界互動基礎的任務,例如:語音辨識,影像文字配對等。有些任務需要理解能力更高的模型,例如自然語言理解,影像問答,或語音語意理解等。 儘管任務的內容十分多元,對於使用類神經網路解這些任務的方法,其共同的特點在於最後的方法幾乎都是表徵學習。表徵的實作以向量居多,任何資料都可以被壓縮成一個足夠大維度的向量(例如:512維),用以涵蓋原始資料的重要資訊。這些向量在訓練完成之後,就可以透過不同的模型來實作最終的任務。向量可以經過分類器完成分類的任務,或是經過解碼器生成序列,達成翻譯,辨識等等任務。因此,表徵學習是非常熱門且實用的研究主題,如何有效率的壓縮資訊成為向量是研究的主要重點,這個方法在文字,影像,以及音訊上都有許多前人的研究,發展出不同的技巧,這些技巧大多都利用到自監督式的訓練,因而得以藉由未標註的資料學習。 當資料含有多種模態時(文字,影像,聲音取二者以上),大多需要多模態學習從不同的模態中擷取資訊,就需要配對的資料共訓練。另一方面,基於自監督式模型的優異表現,架構上通常會利用現有的自監督式模型及其參數加以結合設計而成。如此一來,就可以完成需要理解不同模態的教互關係才能達成的任務,例如影像問答,或語音語意理解。由於不同的單模態系統並不會共用一個向量空間,因此從單模態表徵學習轉換到多模態表徵學習有一些挑戰。 在我的碩論中,我主要研究三個以多模態資料強化自監督模型的方法。第一個面向是影像增強語意理解,利用影像的資訊來提升語意理解的成效。第二個面向是研究非督導式語音辨識中,不同領域的文字與語音任務,如何影響到結果的表現。第三個面向是語音語意理解,旨在研究文字自監督模型的輸入粒度如何影響到最終語音理解成果表現。zh_TW
dc.description.abstractThe advancement of neural networks and GPU has enabled machines to accomplish cognitive tasks, some of which outperforms human baselines. Some tasks focus on general understanding of the interaction environment of humans such as speech recognition, image text matching and optical character recognition. Advanced tasks dive into the semantics of these signals, such as natural language understanding, visual-text question answering, and spoken language question understanding. Despite the variety of tasks, the common approach for computers to solving these tasks is by representation learning. Each representation is typically in the form of an embedding, which is a vector of a sufficiently large dimension (for example 512) that will contain the essential information from the signal. Once trained, the vectorized output could be pipelined to a classifier to solve classification problems, or a sequential decoder for sequence generation. Therefore, representation learning is of high interests to researchers. Signals of all types of such as text, speech, and images all have different pretraining strategies being developed to efficiently extract information. Multimodal learning is a type as representation learning that requires knowledge from multiple modalities. Transferring from single modality represenatation learning to multimodality representation learning has challenges, because multiple single modality systems do not share the same embedding spac., Therefore, learning multimodal relations often involves specific cross-modal structures combined with multimodal pretraining. Applications of multimodal learning include speech language understanding, image question answering, etc. In my thesis, I research into three directions related to enhancing multi-modal learning. The first part is visually enhanced language learning, where visual information is used to enhance the natural language understanding. The second part is on the robustness of unsupervised ASR, in which I experimented with different domains of speech and text to determine how domain mismatch affects performance. The third direction covers solving the speech language understanding task, and I explored the effect of granularity of text model inputs on the final results on spoken language understanding.en
dc.description.provenanceMade available in DSpace on 2023-03-20T00:02:44Z (GMT). No. of bitstreams: 1
U0001-0908202220272100.pdf: 1792078 bytes, checksum: 6c31ec0ae1ce4b836f919db77351ecc2 (MD5)
Previous issue date: 2022
en
dc.description.tableofcontents目錄 致謝 iii 摘要 v Abstract vii 目錄 ix 圖目錄 xiii 表目錄 xv 第一章 導論 1 1.1 研究動機 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 研究方向 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 研究貢獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 章節安排 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 第二章 背景知識 5 2.1 單模態深層類神經網路 . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 全連接類神經網路 . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 卷積式類神經網路 . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 變換器類神經網路及其變種 . . . . . . . . . . . . . . . . . . . . 10 2.2 多模態深層類神經網路及預訓練任務 . . . . . . . . . . . . . . . . . 14 2.2.1 影像文字系統 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 文字語音系統 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 第三章 以影像資訊增強文字理解系統表現 17 3.1 簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.1 任務說明 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.2 自然語言理解發展 . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.3 多模態(文字與影像)理解發展 . . . . . . . . . . . . . . . . . . 18 3.2 方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.1 模型選擇與考量 . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.2 模型架構細節 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.3 多模態訓練目標 . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3 實驗結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.1 客觀評估 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.2 結果分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.3 切除研究 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.4 小結與展望 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 第四章 語音-文字非督導式語音辨識系統穩固性之研究 35 4.1 簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1.1 任務說明 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1.2 非督導語音辨識系統發展 . . . . . . . . . . . . . . . . . . . . . . 36 4.2 方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.1 使用資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.1.1 語音資料集–LibriSpeech . . . . . . . . . . . . . . . . 37 4.2.1.2 語音資料集–科技, 娛樂, 設計演講集錦 (TED-LIUM v3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.1.3 語音資料集–SwitchBoard . . . . . . . . . . . . . . . 38 4.2.1.4 文字資料集–LibriLM . . . . . . . . . . . . . . . . . . 38 4.2.1.5 文字資料集–維基百科(Wiki103) . . . . . . . . . . 39 4.2.1.6 文字資料集–NewsCrawl . . . . . . . . . . . . . . . . 39 4.2.1.7 文字資料集–影像說明資料集(ImageCorpus) . . . 40 4.2.2 實驗流程 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2.3 量化文字與語音的相似度 . . . . . . . . . . . . . . . . . . . . . . 42 4.3 實驗結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3.1 文字與語音的相異度 . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3.2 非督導式語音辨識結果 . . . . . . . . . . . . . . . . . . . . . . . 44 4.3.3 非督導式語音辨識結果與文字語音相異度的關係 . . . . . . . . 46 4.4 小結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 第五章 藉由改進語音-文字貼合度以增強語音-文字語意理解系統 49 5.1 簡介 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1.1 任務說明 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1.2 語音理解系統發展 . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2 方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.1 使用資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2.2 模型選擇與考量 . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2.3 以音位實行二階段預訓練之模型–T5lephone . . . . . . . . . . . 55 5.2.4 模型驗證於串級語音問答 . . . . . . . . . . . . . . . . . . . . . . 56 5.2.5 模型微調於端對端語音翻譯 . . . . . . . . . . . . . . . . . . . . 59 5.3 實驗結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3.1 驗證於串級語音問答 . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3.2 微調於端對端語音問答 . . . . . . . . . . . . . . . . . . . . . . . 61 5.3.3 微調於端對端語音翻譯 . . . . . . . . . . . . . . . . . . . . . . . 62 5.4 小結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 第六章 結論與展望 65 6.0.1 研究總結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.0.2 未來展望及方向 . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 參考文獻 67
dc.language.isozh-TW
dc.subject深度學習zh_TW
dc.subject多模態學習zh_TW
dc.subject語音處理zh_TW
dc.subject深度學習zh_TW
dc.subject多模態學習zh_TW
dc.subject語音處理zh_TW
dc.subjectMutilmodal Learningen
dc.subjectSpeech Processingen
dc.subjectDeep Learningen
dc.subjectMutilmodal Learningen
dc.subjectDeep Learningen
dc.subjectSpeech Processingen
dc.title以多模態資訊強化自督導式學習zh_TW
dc.titleEnhanced Self-Supervised Learning by Multimodal Informationen
dc.typeThesis
dc.date.schoolyear110-2
dc.description.degree碩士
dc.contributor.coadvisor曹昱(Yu Tsao)
dc.contributor.oralexamcommittee李琳山(Lin-Shan Lee),蔡宗翰(Tsung-Han Tsai)
dc.subject.keyword多模態學習,語音處理,深度學習,zh_TW
dc.subject.keywordSpeech Processing,,Mutilmodal Learning,Deep Learning,en
dc.relation.page78
dc.identifier.doi10.6342/NTU202202225
dc.rights.note同意授權(全球公開)
dc.date.accepted2022-08-12
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept資料科學學位學程zh_TW
dc.date.embargo-lift2022-08-18-
顯示於系所單位:資料科學學位學程

文件中的檔案:
檔案 大小格式 
U0001-0908202220272100.pdf1.75 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved