請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101558完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 陳光華 | zh_TW |
| dc.contributor.advisor | Kuang-hua Chen | en |
| dc.contributor.author | 簡翊淇 | zh_TW |
| dc.contributor.author | Ik-Ki Kan | en |
| dc.date.accessioned | 2026-02-11T16:21:00Z | - |
| dc.date.available | 2026-02-12 | - |
| dc.date.copyright | 2026-02-11 | - |
| dc.date.issued | 2026 | - |
| dc.date.submitted | 2026-01-29 | - |
| dc.identifier.citation | 文化部、原住民族委員會、客家委員會、教育部(2022,6月)。國家語言整體發展方案(111-115年)。https://www.moc.gov.tw/News_Content.aspx?n=167&s=95743
王駿發、黃保章、林順傑(1999)。國語文句翻台語語音系統之研究。在第十二屆自然語言與語音處理研討會論文集(頁37-53)。 李尉綸(2012)。利用關聯式規則解決台語文轉音系統中一詞多音之歧異(未出版之碩士論文)。國立中興大學資訊網路多媒體研究所,臺中市。 邱順源(2017)。利用同義詞和剖析樹改善台語多音詞發音預測之準確率(未出版之碩士論文)。國立中興大學資訊工程學系所,臺中市。 姚榮松(2016)。《臺灣閩南語常用詞辭典》成果維護紀要。台灣學誌,14,75-95。 洪惟仁(2010)。閩南語書寫法的理想與現實。臺灣語文研究,5(1),81-108。 洪惟仁(2013)。台灣的語種分布與分區。語言暨語言學,14(2),315-369。 國家語言發展法(2019)。全國法規資料庫。https://law.moj.gov.tw/LawClass/LawAll.aspx?pcode=H0170143 教育部(2006)。臺灣閩南語羅馬字拼音方案使用手冊。https://language.moe.gov.tw/uploads/files/17579929681945.pdf 許文漢、曾證融、廖元甫、王文俊、潘振銘(2020)。基於深度學習之中文文字轉台語語音合成系統初步探討。中文計算語言學期刊,25(2),69-83。 陳慕真(2015)。白話字的起源與在台灣的發展(未出版之博士論文)。國立臺灣師範大學台灣語文學系,臺北市。 陳麗雪(2009)。《荔鏡記》指示詞的語法、語義特點。臺灣文學研究集刊,5,191-202。 楊允言(2013)。對語料庫觀察台語文書寫ê多元現象。台語研究,5(1),4-26。doi:10.6621/JTV.2013.0501.01 黎少銘(2015)。香港網絡語言初探,中國語文通訊,94(1),3-26。 簡鴻文(2014)。台灣火星文研究(未出版之碩士論文)。玄奘大學應用外語學系碩士班,新竹市。 Androutsopoulos, J. K. (2015). Networked multilingualism: Some language practices on Facebook. International Journal of Bilingualism, 19(2), 185-205. Baron, N. S. (2008). Always on: Language in an online and mobile world. Oxford University Press. Coulmas, F. (2005). Sociolinguistics: The study of speakers' choices. Cambridge University Press. Crystal, D. (2001). Language and the Internet. Cambridge University Press. Dembe, T. (2024). The impact of social media on language evolution. European Journal of Linguistics, 3(3), 1-14. Jiao, W., Wang, W., Huang, J. T., Wang, X., Shi, S., & Tu, Z. (2023). Is ChatGPT a good translator? Yes with GPT-4 as the engine. arXiv preprint arXiv:2301.08745. Jurafsky, D., & Martin, J. H. (2025). Speech and Language Processing (3rd ed. draft). https://web.stanford.edu/~jurafsky/slp3/ Lu, B. H., Lin, Y. H., Lee, A., & Tsai, R. T. H. (2024, May). Enhancing Taiwanese hokkien dual translation by exploring and standardizing of four writing systems. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 6077-6090). Lu, S. E., Lu, B. H., Lu, C. Y., & Tsai, R. T. H. (2022, December). Exploring methods for building dialects-Mandarin code-mixing corpora: A case study in Taiwanese Hokkien. In Findings of the Association for Computational Linguistics: EMNLP 2022 (pp. 6287-6305). Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1(5), 206-215. Sproat, R., Black, A. W., Chen, S., Kumar, S., Ostendorf, M., & Richards, C. D. (2001). Normalization of non-standard words. Computer speech & language, 15(3), 287-333. Wiechetek, L., Pirinen, F., Hämäläinen, M., & Argese, C. (2021, September). Rules ruling neural networks-neural vs. rule-based grammar checking for a low resource language. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021) (pp. 1526-1535). Zaiets, V., Zadorizhna, N., Ilchenko, I., Sablina, S., Udovichenko, H., & Zahorodnia, L. (2021). The dominant features of the internet linguistics. Revista Entrelínguas, 7(00), 1-15. | - |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101558 | - |
| dc.description.abstract | 本研究旨在解決數位網路環境中,台語文本因書寫形式混亂(台語火星文)而導致資訊難以辨識與流通的問題。隨著網路社群普及,使用者常採用華語借音、注音符號或拼音混用等非規範方式書寫台語,造成文本高度歧異。本研究以教育部《臺灣台語常用詞辭典》為標準,建立一套自動化轉譯機制,期能提升台語數位文本的一致性與可讀性。
本研究採用混合研究法,從批踢踢實業坊、Facebook、Threads等平台擷取語料,共獲得有效語料28,886筆,並以8:2比例隨機分割為訓練集與測試集。技術實作上,本研究建構「雙層轉譯機制」:第一層為規則導向映射模組,運用「多對一映射」原則;第二層為編輯距離模糊比對模組,透過「注音中介轉碼」實施三軌比對(純漢字字形、語音轉碼比對規範字、語音轉碼比對已知變體)。 本系統在5,778筆測試語料中,準確率達99.83%,精確率更達99.95%。量化分析證實,第一層規則導向模組能解決98.96%的高頻慣用詞,第二層模糊比對模組則能有效召回字典未收錄之變體。質性分析發現,高達91.04%的火星文屬於「華語借音型」,反映使用者受限於輸入法而產生「視覺化台語」的行為。實驗數據進一步證實,語音特徵在轉譯中的貢獻度遠高於字形特徵。 本研究結論指出,規則導向結合統計相似度的混合模式,能為台語這類低資源語言提供具解釋力的「白盒子」文本正規化方案。本研究成果可作為大型語言模型(LLM)之前端預處理,有效降低雜訊並提升模型效能。未來建議應建立系統化的語料蒐集與自動識別機制,以促進本土語言在數位時代的持續發展。 | zh_TW |
| dc.description.abstract | This study aims to resolve challenges in information accessibility and dissemination caused by the disorganized writing forms of Taiwanese text (commonly referred to as Taiwanese Internet Slang) in digital network environments. With the proliferation of online communities, users frequently adopt non-standard strategies such as Mandarin homophonic borrowing, Bopomofo symbols, or mixed phonetic spellings to write Taiwanese, leading to high textual divergence. Using the Dictionary of Frequently-Used Taiwanese Taigi prescribed by the Ministry of Education as the standard, this study establishes an automated translation mechanism to enhance the consistency and readability of digital Taiwanese texts.
A mixed-methods research approach was adopted. A corpus was collected from platforms including PTT, Facebook, and Threads, yielding a total of 28,886 valid entries, which were then randomly split into training and test sets in an 8:2 ratio. In terms of technical implementation, a "two-layer translation mechanism" was constructed. The first layer consists of a rule-based mapping module utilizing the "many-to-one mapping" principle. The second layer is an edit distance fuzzy matching module that implements "triple-path matching" (pure Han character glyphs, phonetic transcoding against standard characters, and phonetic transcoding against known variants) through "Bopomofo-mediated transcoding." Results indicate that in a test set of 5,778 entries, the system achieved an accuracy of 99.83% and a precision of 99.95%. Quantitative analysis confirms that the first-layer rule-based module resolved 98.96% of high-frequency conventional terms, while the second-layer fuzzy matching module effectively recalled variants not included in the dictionary. Qualitative analysis revealed that 91.04% of the slang belongs to the "Mandarin homophonic borrowing" type, reflecting the practice of "visualized Taiwanese," a phenomenon consistent with eye dialect, resulting from input method constraints. Experimental data further verify that the contribution of phonetic features to translation is significantly higher than that of glyph features. The conclusion of this study indicates that a hybrid model combining rule-based mapping with statistical similarity provides an interpretable "white-box" text normalization solution for low-resource languages like Taiwanese. The research findings can serve as front-end preprocessing for Large Language Models (LLMs), effectively reducing noise and improving model performance. Future research should focus on establishing systematic corpus collection and automated identification mechanisms to promote the sustainable development of local languages in the digital era. | en |
| dc.description.provenance | Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-02-11T16:21:00Z No. of bitstreams: 0 | en |
| dc.description.provenance | Made available in DSpace on 2026-02-11T16:21:00Z (GMT). No. of bitstreams: 0 | en |
| dc.description.tableofcontents | 口試委員會審定書 ii
謝辭 iv 摘要 vi Abstract viii 目次 x 圖次 xii 表次 xiii 第一章 緒論 1 第一節 研究動機 1 第二節 研究目的與研究問題 3 第三節 名詞解釋 4 第四節 研究範圍與限制 5 第二章 文獻回顧 7 第一節 臺灣台語文字系統與書寫規範之演變 7 第二節 台語資訊處理發展現況 11 第三節 電腦中介傳播與非規範語言現象 12 第四節 文本正規化與自動化轉譯技術 14 第五節 小結 16 第三章 研究設計與實施 17 第一節 研究流程 17 第二節 研究對象與資料來源 19 第三節 轉譯機制設計 21 第四節 實施步驟 26 第五節 資料分析方法 29 第四章 研究結果與分析 31 第一節 語料描述與分布 31 第二節 台語火星文類型分析 39 第三節 台語火星文之使用情境 41 第四節 轉換規則建置成果 43 第五節 準確率與錯誤分析 48 第五章 結論與建議 57 第一節 結論 57 第二節 研究限制 58 第三節 未來研究之建議 58 參考文獻 61 | - |
| dc.language.iso | zh_TW | - |
| dc.subject | 臺灣台語 | - |
| dc.subject | 台語火星文 | - |
| dc.subject | 文本正規化 | - |
| dc.subject | 編輯距離 | - |
| dc.subject | 教育部規範用字 | - |
| dc.subject | Taiwanese (Taigi) | - |
| dc.subject | Taiwanese Internet Slang | - |
| dc.subject | Text Normalization | - |
| dc.subject | Edit Distance | - |
| dc.subject | Standard Taiwanese Characters | - |
| dc.title | 台語火星文轉譯教育部台語規範用字之研究 | zh_TW |
| dc.title | A Study on the Translation of Taiwanese (Taigi) Internet Slang into the Standard Taiwanese Characters Prescribed by the Ministry of Education | en |
| dc.type | Thesis | - |
| dc.date.schoolyear | 114-1 | - |
| dc.description.degree | 碩士 | - |
| dc.contributor.oralexamcommittee | 唐牧群;陳舜德 | zh_TW |
| dc.contributor.oralexamcommittee | Muh-Chyun Tang;Shun-Der Ryan Chen | en |
| dc.subject.keyword | 臺灣台語,台語火星文文本正規化編輯距離教育部規範用字 | zh_TW |
| dc.subject.keyword | Taiwanese (Taigi),Taiwanese Internet SlangText NormalizationEdit DistanceStandard Taiwanese Characters | en |
| dc.relation.page | 63 | - |
| dc.identifier.doi | 10.6342/NTU202504781 | - |
| dc.rights.note | 同意授權(全球公開) | - |
| dc.date.accepted | 2026-02-02 | - |
| dc.contributor.author-college | 文學院 | - |
| dc.contributor.author-dept | 圖書資訊學系 | - |
| dc.date.embargo-lift | 2026-02-12 | - |
| 顯示於系所單位: | 圖書資訊學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-114-1.pdf | 2.24 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
