應用倉頡編碼特徵於中文人名性別預測之研究

Chu-Hsiang Wei; 魏取向

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/6860

標題:	應用倉頡編碼特徵於中文人名性別預測之研究 Predicting Genders of Chinese Names Using Sub-Character Features: An Experiment Using CangJie Codes
作者:	Chu-Hsiang Wei 魏取向
指導教授:	盧信銘(Hsin-Min Lu)
關鍵字:	文件探勘,中文人名,性別預測,支援向量機,中文字子結構,倉頡編碼, text mining,Chinese name,gender prediction,support vector machine,Chinese sub-character,Cangjie coding,
出版年 :	2012
學位:	碩士
摘要:	日常生活中，對於素昧平生的人們，第一印象往往來自他的名字，我們常試著從名字中推敲他的性別、與其他人的關係（如是否與認識的人是兄弟）甚至樣貌。一般來說，性別是最顯而易見也最無爭議的。我們甚至可以推論，中文人名中本身就蘊含著性別資訊，而這些資訊往往能提供我們重要的人際線索。　　本研究以倉頡碼對中文人名進行編碼，並配合性別資料藉由支援向量機學習中文字的性別特徵，進而達到以中文人名預測性別。在本研究中，我們比較了K-最鄰近法與支援向量機的結果，並且對倉頡編碼採用不同的組合模式，企圖找出預測中文人名性別最精確的方法。　　由於中文人名中存在著兩性皆可使用的名稱，所以性別預測難以達到100%的準確率。在本實驗中發現以支援向量機搭配倉頡四連詞（4-grams）的準確率最高，達到最高可能預測結果的93.59%。另外我們透過問卷比較人類判斷性別與系統判斷性別的差異，在統計檢定下為不顯著，代表系統處理中文人名的性別判斷與人類判斷無異。此外我們以模型對其他不同的資料集作測試，如臉書的好友名稱、英文譯名等，一樣展現出超過85%的準確率。在本實驗的最後，我們將模型套用在台灣商家與台灣個股的名稱中，檢視不同類型的商店或類股是否會有不同的性別比例，從實驗結果中也發現的確存在這樣的差異。　　本研究從中文人名的性別預測延伸到商家名稱等非人名的中文字，而發現以倉頡碼拆解中文字的確可以達到以字型表示文字某些特性，進而增加中文自然語言處理的可能性。除了利用本實驗的結果建立自動化大量人名性別判定的系統外，也可以在文件探勘時使用性別屬性而提供文章不同的特徵，可能可以提升文件分類、分群或觀點分析的準確率。另外最重要的是，本實驗代表著可以以倉頡碼描述中文文字性別傾向，因而開啟後續研究以倉頡碼描述中文其他屬性的大門。 In daily life, when we meet people we don't know, our first impressions usually come from their names: we often try to guess their gender, relationship with others (e.g. whether he is a brother of someone we know), or even appearance. Generally speaking, the gender characteristic in the name is the most obvious. We can even infer that a Chinese name contains gender information, and such information usually provides us with important clues concerning interpersonal relationships. This paper uses CangJie code to represent Chinese names, and uses SVM (support vector machine) to learn the gender characteristics. In this paper, we compared the results of K-NN and adopted different combination modes to the CangJie coding in the SVM to find out the best method to predict of gender of a person through their Chinese name. Because some Chinese names can be used in both genders, it is difficult to achieve the 100% accuracy when predicting the genders. We found that the highest accuracy of gender prediction is about 93.59% (by SVM with Cangjie 4-grams). On the other hand, we compare the gender prediction accuracy by humans and the systems through a questionnaire, and found that there is no significant statistical difference, which means there is no difference in the prediction of the gender of Chinese names between humans and our system. In addition, we applied the model to different data sets, such as Facebook friends’ names, English names (translated in Chinese), and the accuracy also exceeds 85%. Finally, we applied the model to local shop names and stock names in Taiwan, finding the shop type or sector whether can have the different gender proportion, from the experimental result also found there indeed has such difference. We found that the prediction of the gender of Chinese name can be extended to the name of shops and the non-name Chinese characters, and found that the Cangjie code could possibly express the structure of the Chinese character, thus increasing the potential of Chinese natural language processing. The results of the experiment not only institutes the framework for a massive automatic name-sex prediction system, but can also be applied to text mining by provide more features of the articles and increase the accuracy of document classification, clustering, or viewpoint analysis. Moreover, the most importantly, Cangjie code can describe the gender characteristic of a Chinese character, thus opening the gates for future research on using Cangjie code to extract more attributes from Chinese characters.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/6860
全文授權:	同意授權(全球公開)
顯示於系所單位：	資訊管理學系

文件中的檔案：

檔案	大小	格式
ntu-101-1.pdf	1.29 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。