Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 工學院
  3. 工程科學及海洋工程學系
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/19454
Title: 深度學習應用於部落格文章分類
Topic Classification of Blog Posts Using Deep Learning
Authors: Pei-Chun Lin
林佩君
Advisor: 丁肇隆(Chao-Lung Ting)
Keyword: 自然語言處理,機器學習,社群網站,損失函數,斷詞系統,
Natural language processing,Machine learning,Social network,Loss function,Word segmentation system,
Publication Year : 2020
Degree: 碩士
Abstract: 由於網路每天有巨量文章產出,所以正確的文章分類,可以加速讀者在閱讀搜尋上的效率。據痞客邦網站的統計,有近50%的部落格文章未勾選文章所屬類別。本論文提出一自定義損失函數,協助提高這類的文章來進行正確的主題分類。經過本論文所提出之分類系統,可協助痞客邦系統後台自動得知該文章之主題分類 。
文章分別以Jieba斷詞系統及CKIP斷詞系統進行斷詞,實驗結果發現使用Jieba斷詞系統之分類正確率為92.60%,而使用CKIP斷詞系統之正確率為93.35%,顯示繁體中文文章在分類分析時,CKIP斷詞系統為輸入文章斷詞之首選。
斷詞後的文章經過預先訓練的詞向量進行編碼,編碼後輸入長短期記憶模型或卷積神經網路進行訓練。訓練時使用自定義之損失函數,其結果之正確率為93.35%,比傳統使用之損失函數之正確率92.98%有更好的成效。顯示本論文提出之自定義損失函數,可協助部落格文章進行更準確之分類。
Due to the huge amount of articles produced on the Internet every day, well-organized article labels can help improve user experience in reading and searching. However, according to the statistics of the Pixnet website, nearly 50% of blog posts are not being labeled by the author. To address this problem, our paper proposes a custom loss function to provide an automatic article labeling system in the website back end. Through this labeling system we can automatically assign accurate labels onto those articles without a label.
We use Jieba word segmentation system and CKIP word segmentation system to segment articles. The experimental result in our study shows that the classification accuracy of the Jieba system is 92.60%, and the accuracy of the CKIP system is 93.35%. Thus, for traditional Chinese characters, the CKIP system is the first choice in word segmentation.
After word segmentation, the articles are coded by pre-trained word vectors, and after encoding, they are input into Long Short-Term Memory models or Convolutional Neural Networks for training. When using our custom loss function during training, the accuracy of the result is 93.35%, which is better than the accuracy of 92.98% of the categorical_crossentropy loss function. In conclusion, our custom loss function proposed in this paper can help blog articles to be classified automatically and accurately.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/19454
DOI: 10.6342/NTU202003652
Fulltext Rights: 未授權
Appears in Collections:工程科學及海洋工程學系

Files in This Item:
File SizeFormat 
U0001-1708202001344200.pdf
  Restricted Access
2.34 MBAdobe PDF
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved