Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 文學院
  3. 圖書資訊學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101558
標題: 台語火星文轉譯教育部台語規範用字之研究
A Study on the Translation of Taiwanese (Taigi) Internet Slang into the Standard Taiwanese Characters Prescribed by the Ministry of Education
作者: 簡翊淇
Ik-Ki Kan
指導教授: 陳光華
Kuang-hua Chen
關鍵字: 臺灣台語,台語火星文文本正規化編輯距離教育部規範用字
Taiwanese (Taigi),Taiwanese Internet SlangText NormalizationEdit DistanceStandard Taiwanese Characters
出版年 : 2026
學位: 碩士
摘要: 本研究旨在解決數位網路環境中,台語文本因書寫形式混亂(台語火星文)而導致資訊難以辨識與流通的問題。隨著網路社群普及,使用者常採用華語借音、注音符號或拼音混用等非規範方式書寫台語,造成文本高度歧異。本研究以教育部《臺灣台語常用詞辭典》為標準,建立一套自動化轉譯機制,期能提升台語數位文本的一致性與可讀性。
本研究採用混合研究法,從批踢踢實業坊、Facebook、Threads等平台擷取語料,共獲得有效語料28,886筆,並以8:2比例隨機分割為訓練集與測試集。技術實作上,本研究建構「雙層轉譯機制」:第一層為規則導向映射模組,運用「多對一映射」原則;第二層為編輯距離模糊比對模組,透過「注音中介轉碼」實施三軌比對(純漢字字形、語音轉碼比對規範字、語音轉碼比對已知變體)。
本系統在5,778筆測試語料中,準確率達99.83%,精確率更達99.95%。量化分析證實,第一層規則導向模組能解決98.96%的高頻慣用詞,第二層模糊比對模組則能有效召回字典未收錄之變體。質性分析發現,高達91.04%的火星文屬於「華語借音型」,反映使用者受限於輸入法而產生「視覺化台語」的行為。實驗數據進一步證實,語音特徵在轉譯中的貢獻度遠高於字形特徵。
本研究結論指出,規則導向結合統計相似度的混合模式,能為台語這類低資源語言提供具解釋力的「白盒子」文本正規化方案。本研究成果可作為大型語言模型(LLM)之前端預處理,有效降低雜訊並提升模型效能。未來建議應建立系統化的語料蒐集與自動識別機制,以促進本土語言在數位時代的持續發展。
This study aims to resolve challenges in information accessibility and dissemination caused by the disorganized writing forms of Taiwanese text (commonly referred to as Taiwanese Internet Slang) in digital network environments. With the proliferation of online communities, users frequently adopt non-standard strategies such as Mandarin homophonic borrowing, Bopomofo symbols, or mixed phonetic spellings to write Taiwanese, leading to high textual divergence. Using the Dictionary of Frequently-Used Taiwanese Taigi prescribed by the Ministry of Education as the standard, this study establishes an automated translation mechanism to enhance the consistency and readability of digital Taiwanese texts.
A mixed-methods research approach was adopted. A corpus was collected from platforms including PTT, Facebook, and Threads, yielding a total of 28,886 valid entries, which were then randomly split into training and test sets in an 8:2 ratio. In terms of technical implementation, a "two-layer translation mechanism" was constructed. The first layer consists of a rule-based mapping module utilizing the "many-to-one mapping" principle. The second layer is an edit distance fuzzy matching module that implements "triple-path matching" (pure Han character glyphs, phonetic transcoding against standard characters, and phonetic transcoding against known variants) through "Bopomofo-mediated transcoding."
Results indicate that in a test set of 5,778 entries, the system achieved an accuracy of 99.83% and a precision of 99.95%. Quantitative analysis confirms that the first-layer rule-based module resolved 98.96% of high-frequency conventional terms, while the second-layer fuzzy matching module effectively recalled variants not included in the dictionary. Qualitative analysis revealed that 91.04% of the slang belongs to the "Mandarin homophonic borrowing" type, reflecting the practice of "visualized Taiwanese," a phenomenon consistent with eye dialect, resulting from input method constraints. Experimental data further verify that the contribution of phonetic features to translation is significantly higher than that of glyph features.
The conclusion of this study indicates that a hybrid model combining rule-based mapping with statistical similarity provides an interpretable "white-box" text normalization solution for low-resource languages like Taiwanese. The research findings can serve as front-end preprocessing for Large Language Models (LLMs), effectively reducing noise and improving model performance. Future research should focus on establishing systematic corpus collection and automated identification mechanisms to promote the sustainable development of local languages in the digital era.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101558
DOI: 10.6342/NTU202504781
全文授權: 同意授權(全球公開)
電子全文公開日期: 2026-02-12
顯示於系所單位:圖書資訊學系

文件中的檔案:
檔案 大小格式 
ntu-114-1.pdf2.24 MBAdobe PDF檢視/開啟
顯示文件完整紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved