Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 文學院
  3. 圖書資訊學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97847
標題: 應用深度學習於學科分類與互通框架之研究
A Study on Applying Deep Learning in Disciplines Classification and Crosswalk Framework
作者: 黃建智
Chien-chih Huang
指導教授: 陳光華
Kuang-hua Chen
關鍵字: 自動分類,深度學習,研究分類,機構典藏,知識表徵,
Automatic classification,Deep learning,Research classification,Institutional repository,Knowledge representation,
出版年 : 2025
學位: 博士
摘要: 本研究應用大語言模型與類別對應,建立跨國學科分類表間之自動文件分類。首先以藉由新舊分類表版本間的類別對應,以擴充訓練英文文件集,達致分類效能之提升;再以對比學習將分類能力由英文文件擴展至中文文件;最後使用類別對應將分類能力拓展至臺灣所使用之分類表。分類為圖書館實務常見的書目工具,常規作法是將文件分類至單一分類表,當使用不同分類表的二個機構欲進行館藏評鑑或學術評比時,需先將文件再分類至單一分類法上方能比較機構間之文件集,類別對應為實務常見的再分類方法,本研究進一步結合大語言模型用於跨語言間的類別對應。本研究目標是建構互通模型,用於對應澳洲紐西蘭標準研究分類表 (簡稱澳紐分類表)與臺灣國家科學與技術委員會學門專長分類表(簡稱國科會分類表)之部份類別。選用澳紐分類表為其具有與國科會分類表相仿之三層分類架構,且有大量已分類文件可供訓練分類器,已分類文件來源為澳大利亞地區之機構典藏系統,中文文本則源自台灣各學術單位之機構典藏系統。研究步驟首先以正規方法定義類別間之三種對應關係,分別為無關、完全對應、與部份對應,再以對應關係形成擴展文件集,用以訓練與評估用於澳洲紐西蘭標準研究分類表之分類器,過往文獻中常見之自動研究分類方法皆納入實驗,包含SVM為代表之傳統機器學習演算法、fastText為代表之靜態詞嵌入方法、BERT為代表之編碼器架構、NV-embed為代表之文件向量模型、Meta LLAMA為代表之解碼器模型,及使用對比學習微調之大語言模型,對比學習用於建立澳紐分類表之分類器,以及將英文分類能力延展至中文文本。結果顯示,深度學習模型大多優於傳統機器學習模型,比較傳統SVM及Meta LLAMA 3.1之表現,在大類層級之Macro-F1提升至少7%,中類提升至少9%,小類提升至少17%,惟原始BERT在千類類別之小類表現極弱;語言模型之參數量越多,則分類效能越佳,比較ModernBERT-large及ModernBERT-base,大類提升1.0%至2.5%,中類提升2.2%至4.5%,小類提升9.9%至11.5%;以擴展文件集訓練之模型表現較好。綜合而論,對於澳紐分類表之大類或中類層級的表現以未微調之NV-embed最佳,而在小類層級,採用對比澳紐分類表之類別方式微調之模型效能最佳,此分類器亦具有分類中文文件至澳紐分類表之能力,顯示即使未以雙語對照訓練,經英文文本對澳紐分類表做對照微調訓練,仍可提升大語言模型對中文文本的分類能力;最終,本研究展示該模型分類中文文件至國科會分類表之部份類別。本研究之分類器在實務可用於推薦紐澳分類表類別,結果顯示在推薦至多九個小類類別,可完全預測出八成文件之已標註類別。整體而言,本研究將文件向量視為文件題名及摘要文字之平均字符向量,而類別向量為類別內之平均文件向量,類別向量可用於測量類別間相近性,實務上可做為更新分類表之參考。本研究以大語言模型中的「表徵」,初探「類別」與「分類」之概念,未來將進一步探究大語言模型中「類別表徵」與「類別」之關係,期能助益實務界妥善利用人工智慧於圖書館服務。
This study applies large language models (LLM) and class mapping to achieve automatic document classification for the research classification systems. By leveraging class mapping between two versions of a classification system, the English document training set is extended to achieve improved classification performance. Then, contrastive learning is used to propagate the classification capability from English documents to Chinese documents. Finally, class correspondence is employed to extend the classification capability to Taiwan's classification system. Classification is a common bibliographic tool in library practice and conventionally a document is classified into a single classification system. When two institutions using different classification systems manage to conduct collection conceptus or academic performance evaluation, documents need to be reclassified into a single classification system. Class correspondence tables are a common reclassification method in library practice. This study further combines large language models for cross-language class correspondence. The research objective is to construct an interoperable model for corresponding selected classes between the Australian and New Zealand Standard Research Classification (ANZSRC) and Taiwan's National Science and Technology Council Academic Expertise Classification (NSTC Classification). The ANZSRC was selected because it has a three-tier classification structure similar to the NSTC Classification and has abundant classified documents available for training classifiers. The classified documents come from institutional repository systems in Australia, while Chinese texts are sourced from institutional repository systems in Taiwan. The research procedure first formally defines three types of class mapping relation: non-mapped, possibly-mapped, and definitely-mapped. These class mapping relationships are then used to form expanded document sets for training and evaluating classifiers for the ANZSRC. The classification methods from previous literature are all included in the experiments, including traditional machine learning algorithms represented by SVM, static word embedding methods represented by fastText, encoder architectures represented by BERT, document vector models represented by NV-embed, decoder models represented by Meta LLAMA, and LLMs fine-tuned using contrastive learning. Contrastive learning is used to build ANZSRC classifiers and extend English classification capabilities to Chinese texts. Results show that deep learning models mostly outperform traditional machine learning models. Comparing traditional SVM and Meta LLAMA 3.1 performance, Macro-F1 at the division level of the ANZSRC FoR improved by at least 7%, group level by at least 9%, and field level by at least 17%, though original BERT performed extremely poorly on field level. The more parameters a LLM has, the better its classification performance. Comparing ModernBERT-large and ModernBERT-base, the division level is improved by 1.0% to 2.5%, the group level by 2.2% to 4.5%, and the field level by 9.9% to 11.5%. Models trained on expanded document sets generally perform better. In summary, for division or group levels of ANZSRC, NV-embed, which is not finetuned in this study, performs best; while at the field level, models fine-tuned by contrasting the ANZSRC field classes achieve optimal performance. This classifier has the capability to classify Chinese documents into ANZSRC, showing that even without training by language alignment, contrastive fine-tuning on English texts for ANZSRC can still improve the LLMs' classification ability for Chinese texts. Finally, this study demonstrates the model's classification of Chinese documents into selected classes of the NSTC Classification. The classifiers can be practically used to recommend ANZSRC classes. Results show that when recommending up to 9 field classes, the classes of 80% documents can be completely predicted. To conclude, the document representation and class representation is the core notion for incorporating modern LLM into automatic research classification. A document vector is the average token vectors from the title and abstract text, while a class vector is the average vectors of the in-class documents. Class vectors can be used to measure similarity between class and can serve as a reference for updating classification systems in practice. Future studies on the relationship between 'class representations' and 'class' is essential for practitioners applying artificial intelligence, particularly LLM techniques, to library services.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97847
DOI: 10.6342/NTU202501815
全文授權: 同意授權(限校園內公開)
電子全文公開日期: 2025-07-19
顯示於系所單位:圖書資訊學系

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務)
6.31 MBAdobe PDF
顯示文件完整紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved