邁向低延遲與長語句之同步自動語音翻譯

張致強; Chih-Chiang Chang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86959

標題:	邁向低延遲與長語句之同步自動語音翻譯 Towards Simultaneous Speech Translation for Long Utterances at Low Latency
作者:	張致強 Chih-Chiang Chang
指導教授:	李琳山 Lin-shan Lee
關鍵字:	語音翻譯,同步翻譯,重排序,連續整合發放,端到端, Speech translation,Simultaneous Translation,Reordering,Continuous Integrate-and-fire,End-to-end,
出版年 :	2022
學位:	碩士
摘要:	同步自動語音翻譯（Simultaneous Speech Translation）是一個以機器達成串流語音翻譯的任務；翻譯系統需要在來源語者說話的同時進行翻譯，因此需要有能力翻譯不完整的輸入、及決定讀取寫出的策略。同步自動語音翻譯講求較短延遲與較佳翻譯品質之間的折衝，亦即要能在低延遲下維持好的翻譯品質。此外，由於現實中的語音輸入是連續的，因此模型能否泛化到能處理長語句也顯得重要。在同步自動語音翻譯的做法中，同步機器翻譯（Simultaneous Machine Translation）是預設存在一串流語音辨識（Streaming Speech Recognition）系統，因此其輸入為文字；端到端同步自動語音翻譯則直接以語音訊號作為輸入。在同步機器翻譯上，普遍使用一般翻譯資料集來訓練同步機器翻譯模型；但可能因重排序（Reordering）的問題，造成錯誤的學習目標，或非必要的提高延遲。既有的做法常將參考譯文改寫為單調翻譯（Monotonic Translation）來讓機器學習。相對地，本論文則將翻譯切開為「單調翻譯」與「重排序」兩個模型，並在推論階段只保留單調翻譯的部分，以達到同步機器翻譯。透過實驗發現，本論文所提出之作法可以提升英中翻譯在低延遲下的翻譯品質。在端到端同步自動語音翻譯方面，本論文透過實驗發現，既有的基於單調多頭專注（Monotonic Multihead Attention）機制的作法無法泛化到長語句之語音輸入，因而提出基於連續整合發放（Continuous Integrate-and-fire）機制的作法，並由實驗證實泛化到長語句的能力較佳，而且也在低延遲下可超越基於單調多頭專注機制的作法的翻譯品質。綜上所述，本論文提出之方法可以提升同步自動語音翻譯在低延遲或長語句的翻譯品質，使得同步自動語音翻譯更貼近現實應用。 Simultaneous speech translation (SimulST) involves translating streaming speech using machines. In this task the translation system has to translate while the source speaker is still speaking, so it requires the ability to translate incomplete input, and to determine the read-write policy. SimulST strives for a better trade off between latency and translation quality, and it is essential to maintain good quality at low latency. Additionally, since speech signals are continuous in the real world, it is also crucial that the model can generalize to long utterances. Among the SimulST approaches, Simultaneous machine translation (SimulMT) assumes the existence of a streaming speech recognition system, so it takes texts as input; Meanwhile, end-to-end SimulST directly takes speech signals as input. It is common to leverage typical translation datasets to train SimulMT systems. However, this inevitably causes incorrect learning objectives or high latency, due to the reordering problem between different languages. Existing approaches rewrite the reference translation to be monotonic. In contrast, in this thesis the translation process was divided into a monotonic translation part and a reordering part, while only the monotonic translation part was kept during inference in order to achieve SimulMT. The experiments showed that the proposed approach could improve the English-to-Chinese translation quality at low latency. For end-to-end SimulST, in this thesis the Continuous integrate-and-fire (CIF) mechanism was adapted to the SimulST task. It was found in the experiments that existing approaches based on monotonic multihead attention (MMA) failed to generalize to long utterances, while the proposed CIF-based approach can generalize better. Besides, the CIF-based approach was also shown to outperform the MMA-based approaches under low latency. In summary, this thesis proposed methods that could improve the translation quality at low latency or with long utterance, which are expected to make the SimulST task closer to practical applications.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86959
DOI:	10.6342/NTU202300075
全文授權:	同意授權(全球公開)
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-111-1.pdf	12.28 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。