請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/85559完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 唐牧群(Muh-Chyun Tang) | |
| dc.contributor.author | Li-Ting Hung | en |
| dc.contributor.author | 洪莉婷 | zh_TW |
| dc.date.accessioned | 2023-03-19T23:18:35Z | - |
| dc.date.copyright | 2022-07-19 | |
| dc.date.issued | 2022 | |
| dc.date.submitted | 2022-07-04 | |
| dc.identifier.citation | Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. https://arxiv.org/pdf/1707.02919.pdf Bedi, P., & Sharma, C. (2016). Community detection in social networks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 6(3), 115-135. https://doi.org/10.1002/widm.1178 Blei, D. M., Ng, A., & Jordan, M. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research. 3, 993-1022. https://doi.org/10.1162/jmlr.2003.3.4-5.993. Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10), P10008. Buntine, W. (2009). Estimating Likelihoods for Topic Models. In Z. -H. Zhou, & T. Washio (Eds). Advances in Machine Learning. ACML 2009. Lecture Notes in Computer Science (Vol. 5828, pp. 51-64). Springer. https://doi.org/10.1007/978-3-642-05224-8_6 Calheiros, A. C., Moro, S., & Rita, P. (2017). Sentiment classification of consumer-generated online reviews using topic modeling. Journal of Hospitality Marketing & Management, 26(7), 675-693. Callon, M., Law, J., Rip, A. (1986) Qualitative scientometrics. In M. Callon, M., J. Law, & A. Rip (Eds). Mapping the Dynamics of Science and Technology. Palgrave Macmilla. https://doi.org/10.1007/978-1-349-07408-2_7 Chang, J. & Boyd-Graber, J. & Gerrish, S. & Wang, C. & Blei, D. (2009). Reading tea leaves: How humans interpret topic models. In Y. Bengio, D. Shuurmans, J. D. Lafferty, C. K. I. Williams, & A. Cullota (Eds), Neural Information Processing Systems, 32. 288-296. Curran Associates. https://doi.org/10.5555/2984093.2984126 Courtial, J. P. (1986) Technical issues and developments in methodology. In M. Callon, M., J. Law, & A. Rip (Eds). Mapping the Dynamics of Science and Technology. Palgrave Macmilla. https://doi.org/10.1007/978-1-349-07408-2_11 Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., Harshman R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science. 41(6), 391-407. DiMaggio, P., Nag, M., & Blei, D. (2013). Exploiting affinities between topic modeling and the sociological perspective on culture: Application to newspaper coverage of US government arts funding. Poetics, 41(6), 570-606. Ding, W., & Chen, C. (2014). Dynamic topic detection and tracking: A comparison of HDP, C‐word, and cocitation methods. Journal of the Association for Information Science and Technology, 65(10), 2084-2097. Furnas, G. W., Deerwester, S., Dumais, S. T., Landauer, T. K., Harshman, R. A., Streeter, L. A., & Lochbaum, K. E. (1988). Information retrieval using a singular value decomposition model of latent semantic structure. In Chiaramella, Y. (Ed), Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp.465-480). Association for Computing Machinery. Gao, W., Li, P., & Darwish, K. (2018). Joint topic modeling for event summarization across news and social media streams. In K. -F. Wang, W. Gao, R. Xu, & W. Li (Editors), Social Media Content Analysis: Natural Language Processing and Beyond (pp. 321-346). World Scientific. https://doi.org/10.1142/9789813223615_0022 Gensim. (n.d.). What is Gensim? Retrieved December, 2021, from https://radimrehurek.com/gensim/intro.html Gupta, V. (2020). TED-Scraper. GitHub. Retrieved December, 2021, from https://github.com/The-Gupta/TED-Scraper He, Q. (1999). Knowledge discovery through co-word analysis. Library Trends, 48(1), 133-159. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Gey, F. C., Hearst, M. A., & Tong, R. (Chairs), Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 50-57). Association for Computing Machinery. Hong, L., & Davison, B. D. (2010). Empirical study of topic modeling in twitter. In P. Melville, J. Leskovec, & F. Provost (Chairs), Proceedings of the First Workshop on Social Media Analytics (pp. 80-88). Association for Computing Machinery. https://doi.org/10.1145/1964858.1964870 Hotho, A., Nürnberger, A., & Paaß, G. (2005). A brief survey of text mining. LDV Forum, 20(1), 19-62. Kapadia, S. (2019). Evaluate Topic Models: Latent Dirichlet Allocation (LDA). Medium. Retrieved December, 2021, from https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0 Khasseh, A. A., Soheili, F., Moghaddam, H. S., & Chelak, A. M. (2017). Intellectual structure of knowledge in iMetrics: A co-word analysis. Information processing & management, 53(3), 705-720. Pergamon Pressdoi. https://doi.org/10.1016/j.ipm.2017.02.001 Kim, J., & Kang, P. (2018). Analyzing international collaboration and identifying core topics for the “internet of things” based on network analysis and topic modeling. International Journal of Industrial Engineering, 25(3) Klein, C., Clutton, P., & Polito, V. (2018). Topic modeling reveals distinct interests within an online conspiracy forum. Frontiers in psychology, 9, 189. Lee, P. C., & Su, H. N. (2010). Investigating the structure of regional innovation system research through keyword co-occurrence and social network analysis. Innovation, 12(1), 26-40. Leydesdorff, L., & Nerghes, A. (2017). Co-word maps and topic modeling: A comparison using small and medium-sized corpora (N < 1,000). Journal of the Association for Information Science and Technology, 68(4), 1024-1035. https://doi.org/10.1002/asi.23740 Liu, Q., Zheng, Z., Zheng, J., Chen, Q., Liu, G., Chen, S., ... & Ming, W. K. (2020). Health communication through news media during the early stage of the COVID-19 outbreak in China: digital topic modeling approach. Journal of medical Internet research, 22(4), e19118. Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., ... & Adam, S. (2018). Applying LDA topic modeling in communication research: Toward a valid and reliable methodology. Communication Methods and Measures, 12(2/3), 93-118. Moody, J. (2004). The structure of a social science collaboration network: Disciplinary cohesion from 1963 to 1999. American sociological review, 69(2), 213-238. Newman, D., Lau, J. H., Grieser, K. , & Baldwin, T. (2010). Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 100-108). Association for Computational Linguistics Newman, M. E. J. (2008). Mathematics of Networks. In S. N. Durlauf, & L. E. Blume (Eds), The New Palgrave Dictionary of Economics. Palgrave Macmillan. Pleplé, Q. (2013). Perplexity to evaluate topic models. Retrieved December, 2021, from http://qpleple.com/perplexity-to-evaluate-topic-models/ Python. (n.d.). About. Retrieved December, 2021, from https://www.python.org/about/ spaCy. (n.d.). Features. Retrieved December, 2021, from https://spacy.io/ Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. Handbook of latent semantic analysis, 427(7), 424-440. Surian, D., Nguyen, D. Q., Kennedy, G., Johnson, M., Coiera, E., & Dunn, A. G. (2016). Characterizing Twitter discussions about HPV vaccines using topic modeling and community detection. Journal of medical Internet research, 18(8), e6045. Tang, M.-C., Teng, W., & Lin, M. (2019). Determining the critical thresholds for co-word network based on the theory of percolation transition. A case study in Buddhist studies. Journal of Documentation, 76(2), 462-483. https://doi.org/10.1108/JD-06-2019-0117 TED. (n.d. -a). TED Talks. Retrieved December, 2021, from https://www.ted.com/talks TED. (n.d. -b). History of TED. Retrieved December, 2021, from https://www.ted.com/about/our-organization/history-of-ted TED. (n.d. -c). TEDGlobal. Retrieved March, 2022, from https://www.ted.com/attend/conferences/tedglobal TED. (n.d. -d). TEDPrize. Retrieved March, 2022, from https://www.ted.com/about/programs-initiatives/ted-prize TED. (n.d. -e). TEDTalks. Retrieved April, 2022, from https://www.ted.com/about/programs-initiatives/ted-talks TED. (n.d. -f). TEDx events. Retrieved April, 2022, from https://www.ted.com/tedx/events TED. (n.d. -g). About TED-Ed. Retrieved April, 2022, https://www.ted.com/about/programs-initiatives/ted-ed TIOBE. (2021). TIOBE Index for September 2021. Retrieved December, 2021, from https://www.tiobe.com/tiobe-index/ Vogel, J. (2017). Distributed and connected information in the Internet. In Schuster A. J. (Ed.), Understanding information: From the big bang to big bata. (pp. 153-174). Springer. https://doi.org/10.1007/978-3-319-59090-5_8 Wang, C., & Blei, D. M. (2011). Collaborative topic modeling for recommending scientific articles. In C. Apte (Chair), Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 448-456). Association for Computing Machinery. https://doi.org/10.1145/2020408.2020480 Xu, J. (2018). Topic Modeling with LSA, PLSA, LDA & lda2Vec. Medium. Retrieved December, 2021, from https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 Yan, X., Guo, J., Lan, Y., & Cheng, X. (2013). A biterm topic model for short texts. In D. Schwabe, V. Almeida, & H. Glase (Chairs), Proceedings of the 22nd international conference on World Wide Web (pp. 1445-1456). Association for Computing Machinery. https://doi.org/10.1145/2488388.2488514 Zhao, W., Chen, J. J., Perkins, R., Liu, Z., Ge, W., Ding, Y., & Zou, W. (2015). A heuristic approach to determine an appropriate number of topics in topic modeling. BMC Bioinformatics, 16(13 Supplement), S8. https://doi.org/10.1186/1471-2105-16-s13-s8 Zuo, Y., Zhao, J., & Xu, K. (2016). Word network topic model: a simple but general solution for short and imbalanced texts. Knowledge and Information Systems, 48(2), 379-398. | |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/85559 | - |
| dc.description.abstract | 一、研究目的與問題 本研究藉由文字探勘技術對TED Talks上的影片進行探勘,並根據不同欄位的資料類型,在社群偵測和主題建模兩種自動化方法之間選擇合適的方法探索TED Talks的主題,以進一步觀察呈現影片不同面向資訊的資料欄位,在詮釋同一份資料集的主題上有何同與不同之處。TED Talks平台的資料除了有影片基本描述資訊外還有影片內容的逐字稿,因此提供了以量化方式研究影片內容的可能。 由於不同欄位的資料類型所適用的主題萃取方法不同,針對資料類型選用最能夠有效探勘其內容的方式,才能夠確保結果具有解讀意義。此外,不同欄位的資料描述的是影片不同面向的資訊,因此就算是針對同一份資料集進行分析,也可能因為分析對象來自不同欄位以及其搭配的分析方法,而得到不同數量或詞彙組合的分群結果。然而,也正因探索的對象是同一份資料集,所以不同方法的結果所呈現出來的主要主題不應相去太遠,甚至可受惠於探勘的資料面向多元而相互補充,使得最終對於資料集中蘊含之主題的詮釋更加完整。 具體而言,本研究切入分析的角度與欲回答的研究問題為:(一)對TED Talks資料集進行探索性資料分析,以認識資料集概貌,並初探資料集中可能包含的主題;(二)使用不同方法對TED Talks的不同資料欄位進行主題萃取,以完整對資料集的主題探索;(三)以質化和量化的方式,觀察不同主題萃取結果中的主題並計算主題之間的相似性。 二、研究資料與方法 本研究使用的TED Talks影片資料是在2021年4月8號使用Gupta(2020)在Github平台公開提供之程式碼,以Python進行爬取並在當天完成。本次研究選用來萃取TED Talks資料集主題的欄位包含「相關標籤」、「影片描述」(由「影片簡介」和「講者序言」合併而成)以及「逐字稿」。 本研究首先對TED Talks資料集的多個欄位進行探索性資料分析,以認識資料集概貌。接著,則是根據三個欄位的資料格式,使用不同方法進行主題萃取,以完整對資料集的主題探索——針對相關標籤,本研究透過Gephi建立共字網絡以進行社群偵測,並在嘗試不同過濾策略後選擇模組化分數與可解讀性較高的網絡作為主題萃取結果。屬於自然語言的影片描述和逐字稿欄位則需先進行資料預處理,再使用Gensim套件進行主題建模,並考量模型的一致性分數選擇合適的主題模型。最後,再根據主題中的關鍵字以及關鍵字組合所反映出的主題意涵,並以典型文本為輔助,給與各主題命名。在三種主題萃取結果的比較,則是分別以質化和量化的方式,觀察不同主題萃取結果中的主題,並以相關標籤為依據建立主題的向量座標,以計算主題之間的相似性。 三、研究結果 資料集中的熱門影片通常是與日常生活較為相關的經驗與議題,而這些影片的演講調性較為輕鬆,講者傾向用與觀眾生活經驗連結的方式,與觀眾分享知識,而這些知識也多是落在與思想、行為與感受等範疇。在影片的標籤使用上,2000年以前的常見標籤為科技(technology)、設計(design)與科學(science),與TED創立時欲傳播的三大主題的思想(TED, technology, entertainment, design)大致相符;2000以後至今,代表不同議題的標籤變得愈來愈常見,於各地獨立策劃的TEDx演講和TED-Ed系列的影片也愈來愈多。 在三種主題萃取方法所得到的結果中,相關標籤共字網絡的社群偵測到13個主題,對影片描述欄位和逐字稿欄位進行主題建模,則分別得到了25與40個主題。相關標籤中的主題反映出TED初期創辦時關注的三個核心思想相關的主題,包含在熱門影片觀察到的生活經驗相關知識,以及醫學、環境、生物技術等主題,並也偵測到TED-Ed動畫與音樂表演這兩種較為特別的主題。在影片描述欄位的主題模型中,則萃取出比相關標籤欄位萃取出更多可以描述影片內容的主題,逐字稿則因篇幅較大且直接反映文本內容,因此無論是在數量或是對於內容深度的探索,都比從相關標籤欄位或是影片描述欄位要萃取出主題種類多元且意涵更為聚焦的主題。此外,影片描述的主題模型顯示「D8 心理學」與「D12 商業」同為近兩年佔比最高的主題;逐字稿主題模型中近年來較常見的主題則為「T6 希臘羅馬神話」、「T27 環境污染與碳排放」、「T11 個人與職場」、「T37 人工智慧與機器學習」和「T20 社群媒體與網路」等。 在比較三個主題萃取結果後則可以發現,1) 以社群偵測作為方法,在相關標籤共字網絡中萃取出來之主題的數量較少且意涵較為廣泛,原因在於相關標籤在平台上有分類與檢索的功能,故可以將其視為一種如控制詞彙之較為固定的檢索點,而經過相似性的計算可以發現標籤欄位所偵測出來的主題範疇不出另外兩個欄位的主題萃取結果,並確實可以反映影片資訊與內容。2) 分別在影片描述欄位和逐字稿欄位建模後得到的主題,數量較多、主題內的意涵較為狹義,且多具有較高的專指性,原因在於其分析的對象是包含多元詞彙的自然語言文本,又因資料來源的欄位不同,而分別反映出影片的背景資訊和影片的實際內容。3) 三個主題萃取結果中的主題,也分別以不同角度形塑本研究對資料集的認識:相關標籤的主題顯示TED Talks主要談論的議題(discussed issues),而影片描述的主題展示出影片所關懷的是議題中的哪些面向(concerned aspects),逐字稿的主題則可以代表影片內容真正的核心主題(core topics)。 四、結論與討論 透過同時對三個欄位進行文字探勘,並選擇使用適合該欄位之資料型態的主題萃取方法,最後進行結果之間的比較,本研究不僅是從影片觀看者的角度發掘TED Talks演講中所蘊含的主題,亦透過文字探勘掌握資料集的主題範疇,使得對於TED網站的內容探索也更為全面與完整。 | zh_TW |
| dc.description.abstract | 1. Purpose and Question This study applies text mining techniques to explore topics in TED Talks videos. Two automated clustering methods - community detection and LDA topic model – were used to perform topic extraction based on the type of text data in various fields of TED Talks videos. The purpose of topic extraction is to find the topic structure that implies semantic connections between the data from the content characteristics of the input. Automated topic extraction can be realized by a variety of unsupervised learning methods, such as cluster analysis, community detection, and topic models. The results of topic extraction can help users understand the coverage of topics of the data collection, identify similarities and differences between data, and improve the efficiency and comprehensibility of information retrieval. TED is a non-profit organization dedicated to spreading ideas in short and powerful talks. The early emphasis of the talks was on technology, entertainment, and design, but now they cover topics from all subjects and fields. The reason for choosing TED Talks as our data collection for topic extraction is that, in addition to the wide range of the topics they cover, the video content contains many fields that can be used for topic extraction. Despite different aspects of the topic information various fields can provide, fields of different data types require different topic extraction methods to ensure that the results are of good quality and that effective interpretations can be made from them. As a result, it remains an empirical question whether different methods applied in different fields might generate relevant or similar results, and the purpose of the study is to investigate whether the results define equivalent topic ranges of the data set and yet yield complementary topics that make the interpretation of the topic structure more complete. 2. Dataset All 5050 TED Talks data was collected using a Python program on April 08, 2021. Topic extraction was performed on three data fields of the videos: 'related tags', 'video descriptions' (combined with text from the fields 'talk description' and 'speaker: why listen'), and 'transcript'. 3. Method For topic extraction from “related tags,” a co-word network based on which was constructed using Gephi for community detection. To find communities that best represent topics implied in “related tags,” different filtering strategies were used to generate a community detection result with a higher modularity score and better interpretability. As for “talk description” and “transcript,” since texts in which are both in natural language forms, data preprocessing including removing punctuations, deleting stop words, and lemmatization were required before applying Gensim for topic modeling. Coherence scores of different models trained under different topic numbers would be considered when selecting the model that best represents the topic extraction result of the data field. The result of a topic extraction is some topics with their own sets of keywords; thus, the topics would be named by inferring the meaning behind the combination of their keywords. Lastly, the relevance and similarities between topics from three extraction results would be discussed both qualitatively and quantitatively. The qualitative analysis involves the comparison of topics with similar namings, and the quantitative analysis would be performed by generating a vector coordinate for each topic based on related tags and then calculating the cosine similarities between them. 4. Results The result of community detection on the related tag network reveals 13 topics, and the best topic models for video description and transcript are of 25 and 40 topics, respectively. Most topics in LDA topic models can correspond to a certain tag group in the community detection result. Relevance or similarities between topics can be found either from topic names or by calculating their cosine similarities, and the latter shows that topics extracted from two natural text data fields by LDA topic models share more similar topics and also higher similarity scores. Overall, the results suggest that all three topic extraction methods can unveil equivalent topic range implied in the content of TED Talks videos. 5. Conclusion Among the three results, however, groups generated by community detection express the broader meaning of topics as related tags, the data source, carry not only labeling but retrieval and classification functions on the TED Talks website, and visualization has the advantage of giving a readily accessible overview of the topics covered. Topics of video description and transcript in videos, which were discovered by the LDA topic model, are more delicate and precise as the source data are in natural language form. Keywords in topics are found to be able to show nuances in between topics and thus help give a more complete interpretation, and videos with the highest probabilities to appear in topics also help understand the core meaning of the topics. Nevertheless, there is still a little difference between the two LDA model results. Because transcripts are text data directly reflecting video contents and therefore can be seen as their surrogates, topics generated from them are more intuitive, making it easier to interpret the result. | en |
| dc.description.provenance | Made available in DSpace on 2023-03-19T23:18:35Z (GMT). No. of bitstreams: 1 U0001-0407202213502600.pdf: 12459061 bytes, checksum: 8eb6c62298cc618f4fbbdc85b13c299d (MD5) Previous issue date: 2022 | en |
| dc.description.tableofcontents | 序言 I 中文摘要 II ABSTRACT V 第壹章 緒論 1 第一節 研究背景與動機 1 第二節 研究目的與研究問題 4 第三節 名詞解釋 6 第貳章 文獻回顧 8 第一節 共字網絡 8 第二節 主題建模 11 第三節 比較不同主題萃取方法之相關研究 17 第參章 研究設計與實施 21 第一節 研究對象與資料欄位 21 第二節 研究工具 25 第三節 研究方法與流程 27 第四節 資料分析 30 第肆章 研究結果 41 第一節 探索性資料分析 41 第二節 主題萃取結果 48 第三節 三種主題萃取結果的比較 61 第伍章 結論與討論 77 參考資源 84 | |
| dc.language.iso | zh-TW | |
| dc.subject | 主題萃取 | zh_TW |
| dc.subject | TED 演講 | zh_TW |
| dc.subject | 共字網絡 | zh_TW |
| dc.subject | 社群偵測 | zh_TW |
| dc.subject | LDA 主題建模 | zh_TW |
| dc.subject | Community Detection | en |
| dc.subject | TED Talks | en |
| dc.subject | Co-word Network | en |
| dc.subject | Topic Extraction | en |
| dc.subject | LDA Topic Modeling | en |
| dc.title | 使用共字網絡社群偵測與LDA主題建模技術對TED Talks進行主題萃取 | zh_TW |
| dc.title | Using co-word network community detection and LDA topic modeling to extract topics on TED Talks. | en |
| dc.type | Thesis | |
| dc.date.schoolyear | 110-2 | |
| dc.description.degree | 碩士 | |
| dc.contributor.advisor-orcid | 唐牧群(0000-0001-7321-6927) | |
| dc.contributor.coadvisor | 林頌堅(Sung-Chien Lin) | |
| dc.contributor.oralexamcommittee | 吳怡瑾(I-Chin Wu),曾元顯(Yuen-Hsien Tseng) | |
| dc.contributor.oralexamcommittee-orcid | ,曾元顯(0000-0001-8904-7902) | |
| dc.subject.keyword | TED 演講,共字網絡,社群偵測,LDA 主題建模,主題萃取, | zh_TW |
| dc.subject.keyword | TED Talks,Co-word Network,Community Detection,LDA Topic Modeling,Topic Extraction, | en |
| dc.relation.page | 89 | |
| dc.identifier.doi | 10.6342/NTU202201269 | |
| dc.rights.note | 同意授權(全球公開) | |
| dc.date.accepted | 2022-07-06 | |
| dc.contributor.author-college | 文學院 | zh_TW |
| dc.contributor.author-dept | 圖書資訊學研究所 | zh_TW |
| dc.date.embargo-lift | 2022-07-19 | - |
| 顯示於系所單位: | 圖書資訊學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| U0001-0407202213502600.pdf | 12.17 MB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
