請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/19454
標題: | 深度學習應用於部落格文章分類 Topic Classification of Blog Posts Using Deep Learning |
作者: | Pei-Chun Lin 林佩君 |
指導教授: | 丁肇隆(Chao-Lung Ting) |
關鍵字: | 自然語言處理,機器學習,社群網站,損失函數,斷詞系統, Natural language processing,Machine learning,Social network,Loss function,Word segmentation system, |
出版年 : | 2020 |
學位: | 碩士 |
摘要: | 由於網路每天有巨量文章產出,所以正確的文章分類,可以加速讀者在閱讀搜尋上的效率。據痞客邦網站的統計,有近50%的部落格文章未勾選文章所屬類別。本論文提出一自定義損失函數,協助提高這類的文章來進行正確的主題分類。經過本論文所提出之分類系統,可協助痞客邦系統後台自動得知該文章之主題分類 。 文章分別以Jieba斷詞系統及CKIP斷詞系統進行斷詞,實驗結果發現使用Jieba斷詞系統之分類正確率為92.60%,而使用CKIP斷詞系統之正確率為93.35%,顯示繁體中文文章在分類分析時,CKIP斷詞系統為輸入文章斷詞之首選。 斷詞後的文章經過預先訓練的詞向量進行編碼,編碼後輸入長短期記憶模型或卷積神經網路進行訓練。訓練時使用自定義之損失函數,其結果之正確率為93.35%,比傳統使用之損失函數之正確率92.98%有更好的成效。顯示本論文提出之自定義損失函數,可協助部落格文章進行更準確之分類。 Due to the huge amount of articles produced on the Internet every day, well-organized article labels can help improve user experience in reading and searching. However, according to the statistics of the Pixnet website, nearly 50% of blog posts are not being labeled by the author. To address this problem, our paper proposes a custom loss function to provide an automatic article labeling system in the website back end. Through this labeling system we can automatically assign accurate labels onto those articles without a label. We use Jieba word segmentation system and CKIP word segmentation system to segment articles. The experimental result in our study shows that the classification accuracy of the Jieba system is 92.60%, and the accuracy of the CKIP system is 93.35%. Thus, for traditional Chinese characters, the CKIP system is the first choice in word segmentation. After word segmentation, the articles are coded by pre-trained word vectors, and after encoding, they are input into Long Short-Term Memory models or Convolutional Neural Networks for training. When using our custom loss function during training, the accuracy of the result is 93.35%, which is better than the accuracy of 92.98% of the categorical_crossentropy loss function. In conclusion, our custom loss function proposed in this paper can help blog articles to be classified automatically and accurately. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/19454 |
DOI: | 10.6342/NTU202003652 |
全文授權: | 未授權 |
顯示於系所單位: | 工程科學及海洋工程學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
U0001-1708202001344200.pdf 目前未授權公開取用 | 2.34 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。