Please use this identifier to cite or link to this item:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/80221Full metadata record
| ???org.dspace.app.webui.jsptag.ItemTag.dcfield??? | Value | Language |
|---|---|---|
| dc.contributor.advisor | 李宏毅(Hung--Yi Lee) | |
| dc.contributor.author | Tsung-Han Wu | en |
| dc.contributor.author | 吳宗翰 | zh_TW |
| dc.date.accessioned | 2022-11-24T03:02:45Z | - |
| dc.date.available | 2021-07-23 | |
| dc.date.available | 2022-11-24T03:02:45Z | - |
| dc.date.copyright | 2021-07-23 | |
| dc.date.issued | 2021 | |
| dc.date.submitted | 2021-07-12 | |
| dc.identifier.citation | [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017. [2] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed rep- resentations of words and phrases and their compositionality,” arXiv preprint arXiv:1310.4546, 2013. [3] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word repre- sentations in vector space,” arXiv preprint arXiv:1301.3781, 2013. [4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [5] S. Wang, B. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear complexity,” arXiv preprint arXiv:2006.04768, 2020. [6] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser et al., “Rethinking attention with performers,” arXiv preprint arXiv:2009.14794, 2020. [7] M.Cettolo,J.Niehues,S.Stüker,L.Bentivogli,andM.Federico,“Reportonthe11th iwslt evaluation campaign, iwslt 2014,” in Proceedings of the International Workshop on Spoken Language Translation, Hanoi, Vietnam, vol. 57, 2014. [8] D.Bahdanau,K.Cho,andY.Bengio,“Neuralmachinetranslationbyjointlylearning to align and translate,” arXiv preprint arXiv:1409.0473, 2014. [9] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” arXiv preprint arXiv:1802.05365, 2018. [10] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [11] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language un- derstanding by generative pre-training,” 2018. [12] A.Radford,J.Wu,R.Child,D.Luan,D.Amodei,andI.Sutskever,“Languagemod- els are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019. [13] T.B.Brown,B.Mann,N.Ryder,M.Subbiah,J.Kaplan,P.Dhariwal,A.Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020. [14] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” arXiv preprint arXiv:1906.08237, 2019. [15] Y. Liu, W.-L. Zhao, C.-W. Ngo, C.-S. Xu, and H.-Q. Lu, “Coherent bag-of audio words model for efficient large-scale video copy detection,” in Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, 2010, pp. 89–96. [16] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” arXiv preprint arXiv:1909.11942, 2019. [17] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-training text encoders as discriminators rather than generators,” arXiv preprint arXiv:2003.10555, 2020. [18] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoy- anov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” arXiv preprint arXiv:1910.13461, 2019. [19] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” arXiv preprint arXiv:1910.10683, 2019. [20] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-document trans- former,” arXiv preprint arXiv:2004.05150, 2020. [21] M.MacDonellandR.Colwell,“Phylogenyofthevibrionaceae,andrecommendation for two new genera, listonella and shewanella,” Systematic and applied microbiology, vol. 6, no. 2, pp. 171–182, 1985. [22] A. Roy, M. Saffar, A. Vaswani, and D. Grangier, “Efficient content-based sparse attention with routing transformers,” Transactions of the Association for Computa- tional Linguistics, vol. 9, pp. 53–68, 2021. [23] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient transformers: A survey,” arXiv preprint arXiv:2009.06732, 2020. [24] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,” arXiv preprint arXiv:1904.10509, 2019. [25] A. Shrivastava and P. Li, “Improved asymmetric locality sensitive hashing (alsh) for maximum inner product search (mips),” arXiv preprint arXiv:1410.5410, 2014. [26] Y. Bachrach, Y. Finkelstein, R. Gilad-Bachrach, L. Katzir, N. Koenigstein, N. Nice, and U. Paquet, “Speeding up the xbox recommender system using a euclidean trans- formation for inner-product spaces,” in Proceedings of the 8th ACM Conference on Recommender systems, 2014, pp. 257–264. [27] B. Neyshabur and N. Srebro, “On symmetric and asymmetric lshs for inner product search,” in International Conference on Machine Learning. PMLR, 2015, pp. 1926– 1934. [28] Q. Huang, G. Ma, J. Feng, Q. Fang, and A. K. Tung, “Accurate and fast asymmetric locality-sensitive hashing scheme for maximum inner product search,” in Proceed- ings of the 24th ACM SIGKDD International Conference on Knowledge Discovery Data Mining, 2018, pp. 1561–1570. [29] Y. Tay, D. Bahri, D. Metzler, D.-C. Juan, Z. Zhao, and C. Zheng, “Synthesizer: Rethinking self-attention in transformer models,” arXiv preprint arXiv:2005.00743, 2020. [30] H.Peng,N.Pappas,D.Yogatama,R.Schwartz,N.A.Smith,andL.Kong,“Random feature attention,” arXiv preprint arXiv:2103.02143, 2021. [31] Y.Tay,M.Dehghani,S.Abnar,Y.Shen,D.Bahri,P.Pham,J.Rao,L.Yang,S.Ruder, and D. Metzler, “Long range arena: A benchmark for efficient transformers,” arXiv preprint arXiv:2011.04006, 2020. [32] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318. [33] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+ questions for machine comprehension of text,” arXiv preprint arXiv:1606.05250, 2016. [34] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi- task benchmark and analysis platform for natural language understanding,” arXiv preprint arXiv:1804.07461, 2018. [35] A. Raganato, Y. Scherrer, and J. Tiedemann, “Fixed encoder self-attention patterns in transformer-based machine translation,” arXiv preprint arXiv:2002.10260, 2020. [36] A. T. Liu, S.-w. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee, “Mockingjay: Unsu- pervised Speech Representation Learning with Deep Bidirectional Transformer En- coders,” arXiv preprint arXiv:1910.12638, 2019. [37] N.Kitaev,Ł.Kaiser,andA.Levskaya,“Reformer:TheEfficientTransformer,”arXiv preprint arXiv:2001.04451, 2020. [38] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210. [39] P.-H. Chi, P.-H. Chung, T.-H. Wu, C.-C. Hsieh, Y.-H. Chen, S.-W. Li, and H.-y. Lee, “Audio albert: A lite bert for self-supervised learning of audio representation,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 344– 350. [40] Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh, “Large batch optimization for deep learning: Training bert in 76 minutes,” in International Conference on Learning Representations, 2019. [41] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008. [42] J. Shlens, “A tutorial on principal component analysis,” arXiv preprint arXiv:1404.1100, 2014. | |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/80221 | - |
| dc.description.abstract | 近年來,轉換器在各領域已幾乎取代傳統方法以及機器學習中的遞歸式類神經網路,成為當前最熱門、表現最傑出的類神經網路架構。但在傑出表現的背後,我們往往需要付出大量的運算成本,一般來說訓練轉換器時,最大訓練瓶頸會是在自注意力機制的部份,本論文嘗試整理、分析、實作、並比較現有的各式演算法,這些演算法當中,有些是被用理論的方式提出,有的則是只有被實作在推薦系統或電腦視覺領域。將這些方法實作在語音處理以及自然語言處理領域會有何種成效,是此處想探討的重點之一。本論文提出數種轉換器中自注意力機制優化的方法,優化包含時間複雜度層面及記憶體層面。我們嘗試從不同角度切入問題,有的是直接透過壓縮中間層矩陣大小來達到加速的目的,而為了達到最大的加速效果,我們試著對矩陣不同維度進行壓縮;有的則是基於現有、已被提出的輕量模型架構,透過直接修改模型架構的方式,再搭配特殊的初始化方法來加速。這些方法在加速的同時,並不會顯著降低模型表現。 | zh_TW |
| dc.description.provenance | Made available in DSpace on 2022-11-24T03:02:45Z (GMT). No. of bitstreams: 1 U0001-1007202122182200.pdf: 5351701 bytes, checksum: 1ff0759fd0f0cdbe7254fdc2aa638dfb (MD5) Previous issue date: 2021 | en |
| dc.description.tableofcontents | 口試委員會審定書.................................. i 中文摘要....................................... ii 英文摘要....................................... iii 一、導論....................................... 1 1.1 研究動機.................................. 1 1.2 研究方向.................................. 4 1.3 主要貢獻.................................. 4 1.4 章節安排.................................. 5 二、背景知識.................................... 6 2.1 深度學習(DeepLearning)....................... 6 2.1.1 基本原理.............................. 6 2.1.2 轉換器............................... 10 2.2 分佈式表示(DistributedRepresentation)................ 13 2.2.1 詞向量(WordVectors)..................... 14 2.2.2 語境化表示(Contextualized Representation).......... 16 2.3 實驗任務集................................. 18 2.3.1 文字任務.............................. 18 2.3.2 語音任務.............................. 19 2.4 本章總結.................................. 20 三、多頭自注意力機制加速及優化方法...................... 21 3.1 簡介..................................... 21 3.2 自注意力機制............................... 22 3.3 自注意力權重遮罩(AttentionMask).................. 24 3.3.1 固定式(Fixed).......................... 24 3.3.2 步履式(Strided).......................... 25 3.4 雜湊函數.................................. 25 3.4.1 局部敏感雜湊演算法....................... 26 3.5 直接生成.................................. 28 3.5.1 稠密式合成器(DenseSYNTHESIZER)............ 29 3.5.2 隨機式合成器(RandomSYNTHESIZER)........... 30 3.5.3 隨機特徵注意力機制(Random Feature Attention)...... 30 3.6 本章總結.................................. 30 四、基於縮減隱藏層張量進行模型加速...................... 32 4.1 簡介..................................... 32 4.1.1 減少中間層序列長度....................... 32 4.1.2 減少中間層張量維度....................... 35 4.1.3 同時減少序列長度及中間層向量維度.............. 36 4.2 實驗結果與討論.............................. 37 4.2.1 任務描述.............................. 37 4.2.2 縮減張量的加速效果 ....................... 40 4.2.3 縮減張量對於模型表現影響................... 41 4.3 本章總結.................................. 41 五、基於初始化方法進行合成器加速....................... 42 5.1 簡介..................................... 42 5.1.1 合成器............................... 42 5.2 初始化方法................................. 42 5.2.1 自注意力權重分類........................ 43 5.2.2 自注意力權重分析........................ 44 5.3 實驗結果與分析.............................. 45 5.3.1 實驗結果.............................. 45 5.3.2 注意力權重視覺化........................ 49 5.4 本章總結.................................. 50 六、結論與展望................................... 52 6.1 研究貢獻與討論.............................. 52 6.2 未來展望.................................. 53 參考文獻....................................... 54 | |
| dc.language.iso | zh-TW | |
| dc.subject | 自注意力機制 | zh_TW |
| dc.subject | 轉換器 | zh_TW |
| dc.subject | Transformers | en |
| dc.subject | Self-attention | en |
| dc.title | 轉換器中自注意力機制的優化 | zh_TW |
| dc.title | Optimization of Self-attention in Transformers | en |
| dc.date.schoolyear | 109-2 | |
| dc.description.degree | 碩士 | |
| dc.contributor.oralexamcommittee | 李琳山(Hsin-Tsai Liu),鄭秋豫(Chih-Yang Tseng),王小川,陳信宏,簡仁宗 | |
| dc.subject.keyword | 轉換器,自注意力機制, | zh_TW |
| dc.subject.keyword | Transformers,Self-attention, | en |
| dc.relation.page | 59 | |
| dc.identifier.doi | 10.6342/NTU202101383 | |
| dc.rights.note | 同意授權(限校園內公開) | |
| dc.date.accepted | 2021-07-14 | |
| dc.contributor.author-college | 電機資訊學院 | zh_TW |
| dc.contributor.author-dept | 電信工程學研究所 | zh_TW |
| Appears in Collections: | 電信工程學研究所 | |
Files in This Item:
| File | Size | Format | |
|---|---|---|---|
| U0001-1007202122182200.pdf Access limited in NTU ip range | 5.23 MB | Adobe PDF |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
