請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/6381
完整後設資料紀錄
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.advisor | 項潔(Jieh Hsiang) | |
dc.contributor.author | Ruey-Cheng Chen | en |
dc.contributor.author | 陳瑞呈 | zh_TW |
dc.date.accessioned | 2021-05-16T16:27:41Z | - |
dc.date.available | 2013-02-01 | |
dc.date.available | 2021-05-16T16:27:41Z | - |
dc.date.copyright | 2013-02-01 | |
dc.date.issued | 2013 | |
dc.date.submitted | 2013-01-25 | |
dc.identifier.citation | Ismail S. Altingovde, Rifat Ozcan, and ‥Ozg‥ur Ulusoy. A practitioner’s guide for static index pruning. In Mohand Boughanem, Catherine Berrut, Josiane Mothe, and Chantal Soule-Dupuy, editors, Advances in Information Retrieval, volume 5478 of Lecture Notes in Computer Science, chapter 65, pages 675–679. Springer Berlin / Heidelberg, Berlin, Heidelberg, 2009. ISBN 978-3-642-00957- 0. doi: 10.1007/978-3-642-00958-7 65. URL http://dx.doi.org/10.1007/978-3-642-00958-7_65.
Ismail S. Altingovde, Rifat Ozcan, and ‥Ozg‥ur Ulusoy. Static index pruning in web search engines: Combining term and document popularities with query views. ACM Transactions on Information Systems, 30(1), March 2012. ISSN 1046-8188. doi: 10.1145/2094072.2094074. URL http://dx.doi.org/10.1145/2094072.2094074. Shlomo Argamon, Navot Akiva, Amihood Amir, and Oren Kapah. Efficient unsupervised recursive word segmentation using minimum description length. In Proceedings of the 20th international conference on Computational Linguistics, COLING ’04, Stroudsburg, PA, USA, 2004. Association for Computational Linguistics. doi: 10.3115/1220355.1220507. URL http://dx.doi.org/10.3115/1220355.1220507. Nan Bernstein-Ratner. The phonology of parent child speech. Children’s language, 6:159–174, 1987. Roi Blanco and Alvaro Barreiro. Static pruning of terms in inverted files. In Giambattista Amati, Claudio Carpineto, and Giovanni Romano, editors, Advances in Information Retrieval, volume 4425 of Lecture Notes in Computer Science, chapter 9, pages 64–75. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007. ISBN 978-3-540-71494-1. doi: 10.1007/978-3-540-71496-5 9. URL http://dx.doi.org/10.1007/978-3-540-71496-5_9. Roi Blanco and Alvaro Barreiro. Probabilistic static pruning of inverted files. ACM Transactions on Information Systems, 28(1), January 2010. ISSN 1046-8188. doi: 10.1145/1658377.1658378. URL http://dx.doi.org/10.1145/1658377.1658378. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003. URL http: //www.jmlr.org/papers/volume3/blei03a/blei03a.pdf. Michael R. Brent and Timothy A. Cartwright. Distributional regularity and phonotactic constraints are useful for segmentation. In Cognition, pages 93–125, 1996. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.1129. Stefan B‥uttcher and Charles L. A. Clarke. A document-centric approach to static index pruning in text retrieval systems. In Proceedings of the 15th ACM international conference on Information and knowledge management, CIKM ’06, pages 182–189, New York, NY, USA, 2006. ACM. ISBN 1-59593-433-2. doi: 10.1145/1183614.1183644. URL http://dx.doi.org/10.1145/1183614.1183644. David Carmel, Doron Cohen, Ronald Fagin, Eitan Farchi, Michael Herscovici, Yoelle S. Maarek, and Aya Soffer. Static index pruning for information retrieval systems. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’01, pages 43–50, New York, NY, USA, 2001. ACM. ISBN 1-58113-331-6. doi: 10.1145/383952.383958. URL http://dx.doi.org/10.1145/383952.383958. Jing-Shin Chang and Keh-Yih Su. An unsupervised iterative method for chinese new lexicon extraction. In International Journal of Computational Linguistics & Chinese Language Processing, pages 97–148, 1997. URL http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.26.6659&rep=rep1&type=pdf. Lee-Feng Chien. PAT-tree-based keyword extraction for chinese information retrieval. SIGIR Forum, 31(SI):50–58, 1997. ISSN 0163-5840. doi: 10.1145/258525.258534. URL http://dx.doi.org/10.1145/258525.258534. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977. ISSN 00359246. doi: 10.2307/2984875. URL http://web.mit.edu/6.435/www/Dempster77.pdf. Thomas Emerson. The second international chinese word segmentation bakeoff. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, volume 133. Jeju Island, Korea, 2005. Haodi Feng, Kang Chen, Xiaotie Deng, and Weimin Zheng. Accessor variety criteria for chinese word extraction. Comput. Linguist., 30:75–93, March 2004. ISSN 0891-2017. doi: 10.1162/089120104773633394. URL http://dx.doi.org/10.1162/089120104773633394. Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741, November 1984. doi: 10.1080/02664769300000058. URL http://dx.doi.org/10.1080/02664769300000058. Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, ACL-44, pages 673–680, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics. doi: 10.3115/1220175.1220260. URL http://dx.doi.org/10.3115/1220175.1220260. Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. A bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112(1):21–54, July 2009. ISSN 00100277. doi: 10.1016/j.cognition.2009.03.008. URL http: //dx.doi.org/10.1016/j.cognition.2009.03.008. Zellig S. Harris. From phoneme to morpheme. Language, 31(2):190–222, 1955. ISSN 00978507. doi: 10.2307/411036. URL http://dx.doi.org/10.2307/411036. Daniel Hewlett and Paul Cohen. Bootstrap voting experts. In Proceedings of the 21st international jont conference on Artifical intelligence, IJCAI’09, pages 1071–1076, San Francisco, CA, USA, 2009. Morgan Kaufmann Publishers Inc. URL http://portal.acm.org/citation.cfm?id=1661616. Daniel Hewlett and Paul Cohen. Fully unsupervised word segmentation with BVE and MDL. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2, HLT ’11, pages 540–545, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. ISBN 978-1-932432-88-6. URL http://portal.acm.org/citation.cfm?id=2002843. Jin H. Huang and David Powers. Chinese word segmentation based on contextual entropy. In Proceedings of the 17th Asian Pacific Conference on Language, Information and Computation, pages 152–158. Citeseer, 2003. URL http://aclweb.org/anthology/Y/Y03/Y03-1017.pdf. E. T. Jaynes. Information theory and statistical mechanics. Physical Review Online Archive (Prola), 106(4):620–630,May 1957a. doi: 10.1103/PhysRev.106.620. URL http://dx.doi.org/10.1103/PhysRev.106.620. E. T. Jaynes. Information theory and statistical mechanics. II. Physical Review Online Archive (Prola), 108(2):171–190, October 1957b. doi: 10.1103/PhysRev. 108.171. URL http://dx.doi.org/10.1103/PhysRev.108.171. Zhihui Jin and Kumiko T. Ishii. Unsupervised segmentation of chinese text by use of branching entropy. In Proceedings of the COLING/ACL on Main conference poster sessions, COLING-ACL ’06, pages 428–435, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics. URL http://portal.acm.org/citation.cfm?id=1273129. Mark Johnson and Sharon Goldwater. Improving nonparameteric bayesian inference: experiments on unsupervised word segmentation with adaptor grammars. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09, pages 317–325, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. ISBN 978-1-932432-41-1. URL http://portal.acm.org/citation.cfm?id=1620800. Chunyu Kit and Yorick Wilks. Unsupervised learning of word boundary with description length gain. In CoNLL-99, pages 1–6, Bergen, Norway, 1999. Soloman Kullback. Information Theory and Statistics. Wiley, New York, 1959. Gina-Anne Levow. The third international chinese language processing bakeoff: Word segmentation and named entity recognition. In Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, volume 117. Sydney: July, 2006. S. Lloyd. Least squares quantization in PCM. Information Theory, IEEE Transactions on, 28(2):129–137, March 1982. ISSN 0018-9448. doi: 10.1109/TIT.1982.1056489. URL http://dx.doi.org/10.1109/TIT.1982.1056489. J. B. Macqueen. Some methods of classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pages 281–297, 1967. Brian MacWhinney and Catherine Snow. The child language data exchange system: an update. Journal of child language, 17(2):457–472, June 1990. ISSN 0305-0009. URL http://view.ncbi.nlm.nih.gov/pubmed/2380278. Daichi Mochihashi, Takeshi Yamada, and Naonori Ueda. Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1, ACL ’09, pages 100–108, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. ISBN 978-1-932432-45-9. URL http://portal.acm.org/citation.cfm?id=1687894. Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, September 1978. ISSN 00051098. doi: 10.1016/0005-1098(78)90005-5. URL http://dx.doi.org/10.1016/0005-1098(78)90005-5. Stephen Robertson. The probability ranking principle in IR. In Karen S. Jones and Peter Willett, editors, Reading in Information Retrieval, chapter The probability ranking principle in IR, pages 281–286. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997. ISBN 1-55860-454-5. URL http://portal.acm.org/citation.cfm?id=275701. Richard W. Sproat and Chilin Shih. A statistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages, 4(4): 336–351, 1990. Maosong Sun, Dayang Shen, and Benjamin K. Thou. Chinese word segmentation without using lexicon and hand-crafted training data. In Proceedings of the 17th international conference on Computational linguistics - Volume 2, COLING ’98, pages 1265–1271, Stroudsburg, PA, USA, 1998. Association for Computational Linguistics. doi: 10.3115/980432.980775. URL http://dx.doi.org/10.3115/980432.980775. Kumiko Tanaka-Ishii. Entropy as an indicator of context boundaries: An experiment using a web search engine. In Robert Dale, Kam-Fai Wong, Jian Su, and Oi Kwong, editors, Natural Language Processing – IJCNLP 2005, volume 3651 of Lecture Notes in Computer Science, chapter 9, pages 93–105. Springer Berlin / Heidelberg, Berlin, Heidelberg, 2005. ISBN 978-3-540-29172-5. doi: 10.1007/11562214 9. URL http://dx.doi.org/10.1007/11562214_9. Hua Yu. Unsupervised word induction using MDL criterion. In Proceedings of the International Symposium of Chinese Spoken Language Processing, Beijin, China, 2000. Hai Zhao and Chunyu Kit. An empirical comparison of goodness measures for unsupervised chinese word segmentation with a unified framework. In The Third International Joint Conference on Natural Language Processing (IJCNLP-2008), 2008. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.6154. Lei Zheng and Ingemar J. Cox. Entropy-Based static index pruning. In Mohand Boughanem, Catherine Berrut, Josiane Mothe, and Chantal Soule-Dupuy, editors, Advances in Information Retrieval, volume 5478 of Lecture Notes in Computer Science, chapter 72, pages 713–718. Springer Berlin / Heidelberg, Berlin, Heidelberg, 2009. ISBN 978-3-642-00957-0. doi: 10.1007/978-3-642-00958-7 72. URL http://dx.doi.org/10.1007/978-3-642-00958-7_72. Valentin Zhikov, Hiroya Takamura, and Manabu Okumura. An efficient algorithm for unsupervised word segmentation with branching entropy and MDL. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’10, pages 832–842, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. URL http://portal.acm.org/citation.cfm?id=1870739. | |
dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/6381 | - |
dc.description.abstract | 在這篇論文中,我們從機率模型的範疇內推導一個稱作「資訊保存」的數學概念。我們的方法提供了連接數個最佳化原則,例如最大蹢及最小蹢方法(maximum and minimum entropy methods)的基礎。在這個框架中,我們明確地假設模型推衍是一個目標針對某個參考假說的有向過程。為了檢驗這個理論,我們對無監督式斷詞(unsupervised word segmentation)以及靜態索引刪減(static index pruning)進行了詳盡的實證研究。在無監督式斷詞中,我們的方法顯著地提昇了以壓縮為基礎的方法斷詞精確度,並且在效能與效率表現上達到與目前最佳方法接近的程度。在靜態索引刪減上,我們提出的以資訊為基礎的量度(information-based measure)以比其他方法效率更好的方式達到目前最好的結果。我們的模型推衍方法也取得了新發現,像是分群分析(cluster analysis)中的新校正方法。我們期望這個對推衍原則的深度理解能產生機率模型的新方法論,並且最終邁向自然語言處理上的突破。 | zh_TW |
dc.description.abstract | In this dissertation, we motivate a mathematical concept, called information preservation, in the context of probabilistic modeling. Our approach provides a common ground for relating various optimization principles, such as maximum and minimum entropy methods. In this framework, we make explicit an assumption that the model induction is a directed process toward some reference hypothesis. To verify this theory, we conducted extensive empirical studies to unsupervised word segmentation and static index pruning. In unsupervised word segmentation, our approach has significantly boosted the segmentation accuracy of an ordinary compression-based method and achieved comparable performance to several state-of-the-art methods in terms of efficiency and effectiveness. For static index pruning, the proposed information-based measure has achieved state-of-the-art performance, and it has done so more efficiently than the other methods. Our approach to model induction has also led to new discovery, such as a new regularization method for cluster analysis. We expect that this deepened understanding about the induction principles may produce new methodologies towards probabilistic modeling, and eventually lead to breakthrough in natural language processing. | en |
dc.description.provenance | Made available in DSpace on 2021-05-16T16:27:41Z (GMT). No. of bitstreams: 1 ntu-102-D95922008-1.pdf: 825754 bytes, checksum: 7e6e1656e8c4e9fd1b69d1d09d1beb07 (MD5) Previous issue date: 2013 | en |
dc.description.tableofcontents | Acknowledgments i
Abstract iii Abstract (in Chinese) v List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Goals and Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Information Preservation 5 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Formal Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Relation to Other Approaches . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 Principle of Maximum Entropy . . . . . . . . . . . . . . . . . 8 2.3.2 Principle of Minimum Cross-Entropy . . . . . . . . . . . . . . 8 2.3.3 Minimum-Entropy Methods . . . . . . . . . . . . . . . . . . . 9 2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4.3 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Unsupervised Word Segmentation 21 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.1 Transitional Probability . . . . . . . . . . . . . . . . . . . . . 23 3.2.2 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.3 Hierarchical Bayesian Methods . . . . . . . . . . . . . . . . . 25 3.2.4 Minimum Description Length Principle . . . . . . . . . . . . . 26 3.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Regularized Compression . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.5 Iterative Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.7.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.7.2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . 35 3.7.3 Evaluation on Bernstein-Ratner Corpus . . . . . . . . . . . . . 36 3.7.4 Evaluation on Bakeoff-2005 Corpus . . . . . . . . . . . . . . . 37 3.7.5 Effects of Data Size . . . . . . . . . . . . . . . . . . . . . . . . 40 3.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4 Static Index Pruning 45 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.1 Term-Centric and Document-Centric Methods . . . . . . . . . 47 4.2.2 Recent Development . . . . . . . . . . . . . . . . . . . . . . . 48 4.3 Information Preservation . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4 Compositional Approximation . . . . . . . . . . . . . . . . . . . . . . 52 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.5.2 Retrieval Performance . . . . . . . . . . . . . . . . . . . . . . 55 4.5.3 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . 59 4.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5 Discussion and Conclusion 61 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 General Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2.1 Problem-Oriented Analysis . . . . . . . . . . . . . . . . . . . . 63 5.2.2 Relation to Principle of Maximum Cross-Entropy . . . . . . . 64 5.2.3 Implication to Latent Variable Modeling . . . . . . . . . . . . 65 5.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Bibliography 69 A Supplementary Details 77 A.1 Representation System and Entropy . . . . . . . . . . . . . . . . . . . 77 A.2 Rewrite-Update Procedure . . . . . . . . . . . . . . . . . . . . . . . . 79 A.3 Utility-Bias Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 A.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 A.3.2 Perturbation and Utility-Bias Tradeoff . . . . . . . . . . . . . 86 A.3.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 A.3.4 Example: Unsupervised Word Segmentation . . . . . . . . . . 89 A.3.5 Example: Static Index Pruning . . . . . . . . . . . . . . . . . 93 | |
dc.language.iso | en | |
dc.title | 資訊保存與自然語言處理的應用 | zh_TW |
dc.title | Information preservation and its applications to natural language processing | en |
dc.type | Thesis | |
dc.date.schoolyear | 101-1 | |
dc.description.degree | 博士 | |
dc.contributor.oralexamcommittee | 許永真(Jane Yung-Jen Hsu),簡立峰(Lee-Feng Chien),陳信希(Hsin-Hsi Chen),鄭卜壬(Pu-Jen Cheng) | |
dc.subject.keyword | 資訊理論,資訊保存,推衍原則,無監督式斷詞,靜態索引刪減,蹢最佳化, | zh_TW |
dc.subject.keyword | information theory,information preservation,induction principle,unsupervised word segmentation,static index pruning,entropy optimization, | en |
dc.relation.page | 96 | |
dc.rights.note | 同意授權(全球公開) | |
dc.date.accepted | 2013-01-25 | |
dc.contributor.author-college | 電機資訊學院 | zh_TW |
dc.contributor.author-dept | 資訊工程學研究所 | zh_TW |
顯示於系所單位: | 資訊工程學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-102-1.pdf | 806.4 kB | Adobe PDF | 檢視/開啟 |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。