Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 文學院
  3. 圖書資訊學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97847
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor陳光華zh_TW
dc.contributor.advisorKuang-hua Chenen
dc.contributor.author黃建智zh_TW
dc.contributor.authorChien-chih Huangen
dc.date.accessioned2025-07-18T16:09:11Z-
dc.date.available2025-07-19-
dc.date.copyright2025-07-18-
dc.date.issued2025-
dc.date.submitted2025-07-15-
dc.identifier.citationAlsentzer, E., Murphy, J., Boag, W., Weng, W.-H., Jindi, D., Naumann, T., & McDermott, M. (2019, June). Publicly available clinical BERT embeddings.Poceedings of the 2nd Clinical Natural Language Processing Workshop, Minneapolis, Minnesota, USA.
Araci, D. (2019). FinBERT: Financial sentiment analysis with pre-trained language models. arXiv. https://doi.org/10.48550/arXiv.1908.10063
Arhiliuc, C., Guns, R., Daelemans, W., & Engels, T. C. E. (2024). Journal article classification using abstracts: a comparison of classical and transformer-based machine learning methods. Scientometrics, 130(1), 313-342. https://doi.org/10.1007/s11192-024-05217-7
Australian Bureau of Statistics. (2015). Australian Standard Classification of Education (ASCED). Retrieved Sep 14, 2023 from https://www.abs.gov.au/statistics/classifications/australian-standard-classification-education-asced/latest-release
Australian Bureau of Statistics. (2020a). ANZSRC 2020 correspondence to ANZSRC 2008. Retrieved January 28, 2024, from https://www.abs.gov.au/statistics/classifications/australian-and-new-zealand-standard-research-classification-anzsrc/2020/anzsrc2020_anzsrc2008_correspondences.xlsx
Australian Bureau of Statistics. (2020b). Australian and New Zealand Standard Research Classification (ANZSRC). Retrieved January 28, 2024 from https://www.abs.gov.au/statistics/classifications/australian-and-new-zealand-standard-research-classification-anzsrc/latest-release
Australian Research Council. (2021). Classification codes: FoR, RFCD, SEO and ANZSIC codes. Retrieved March 22, 2021 from https://www.arc.gov.au/grants/grant-application/classification-codes-rfcd-seo-and-anzsic-codes
BehnamGhader, P., Adlakha, V., Mosbach, M., Bahdanau, D., Chapados, N., & Reddy, S. (2024). Llm2vec: Large language models are secretly powerful text encoders. First Conference on Language Modeling, Philadelphia, PA.
Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China.
Bishop, C. M., & Nasrabadi, N. M. (2006). Pattern recognition and machine learning. Springer.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine learning research, 3, 993-1022.
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., & Brunskill, E. (2021). On the Opportunities and Risks of Foundation Models. arXiv. https://doi.org/10.48550/arXiv.2108.07258
Bornmann, L. (2018). Field classification of publications in Dimensions: a first case study testing its reliability and validity. Scientometrics, 117(1), 637-640. https://doi.org/10.1007/s11192-018-2855-y
Briet, S. (1951). Qu'est-ce que la documentation? Éditions documentaires, industrielles et techniques. http://martinetl.free.fr/suzannebriet/questcequeladocumentation/
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., & Askell, A. (2020). Language models are few-shot learners.Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Vancouver, Canada.
Bryant, R. (2000). Discovery and decision: Exploring the metaphysics and epistemology of scientific classification. Fairleigh Dickinson University Press. https://books.google.com.tw/books?id=Cda5OeRwtvwC
Buckland, M. (1997). What is a “document”? Journal of the American Society for information science, 48(9), 804-809. https://doi.org/10.1002/(SICI)1097-4571(199709)48:9<804::AID-ASI5>3.0.CO;2-V
Buckland, M. (1998). What is a digital document. Document numérique, 2(2), 221-230.
Cer, D., Yang, Y., Kong, S. Y., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., Strope, B., & Kurzweil, R. (2018). Universal Sentence Encoder. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 169-174. https://doi.org/10.18653/v1/D18-2029
Chan, L. M., & Zeng, M. L. (2006). Metadata interoperability and standardization: A study of methodology part I. D-lib Magazine, 12(6), 1082-9873. https://doi.org/10.1045/june2006-chan
Chen, S. H., Lin, C. L., Chen, C. M., Chen, Y. W., Chiou, C. Y., & Ho, Y. Z. (2012). Xue shu ming ci fan yi zhi xue ke fen lei jia gou tan tao yan jiu bao gao. National Academy for Educational Research.
Chen, Y. W. (2013). A study on the construction of classification structure for academic terms [Master's thesis, National Taiwan University]. Taipei.
Choi, J. Y., Hahn, H., & Jung, Y. (2020). Research on text classification of research reports using Korea National Science and Technology standards classification codes. Journal of the Korea Academia-Industrial cooperation Society, 21(1), 169-177. https://doi.org/10.5762/KAIS.2020.21.1.169
Coen, G., & Smiraglia, R. P. (2019). Toward Better Interoperability of the NARCIS Classification. Knowledge Organization, 46(5), 345-353. https://doi.org/10.5771/0943-7444-2019-5-345
Cohan, A., Feldman, S., Beltagy, I., Downey, D., & Weld, D. (2020, July). SPECTER: Document-level representation learning using citation-informed transformers.Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics Online.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of machine learning research, 12, 2493−2537.
Commonwealth of Australia and New Zealand. (2020). ANZSRC review outcomes paper. https://www.arc.gov.au/file/11468/download?token=oGGN0tA
Cordonnier, J.-B., Loukas, A., & Jaggi, M. (2020). On the relationship between self-attention and convolutional layers. International Conference on Learning Representations Addis Ababa, Ethiopia.
Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2022). Introduction to algorithms (Fourth ed.). MIT Press.
Courtial, J. P., Callon, M., & Sigogneau, M. (1984). Is indexing trustworthy? Classification of articles through co-word analysis. Journal of information science, 9(2), 47-56.
Dahlberg, I. (1998). Classification structure principles: Investigations, experiences and conclusions. Advances in Knowledge Organization, Lille, France.
Deutscher, G. (2002). On the misuse of the notion of ‘abduction’in linguistics. Journal of Linguistics, 38(3), 469-485.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171-4186. https://doi.org/10.18653/v1/N19-1423
Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S., & Harshman, R. (1988). Using latent semantic analysis to improve access to textual information. Proceedings of the SIGCHI conference on Human factors in computing systems,
Er, M. J., Venkatesan, R., & Wang, N. (2016). An online universal classifier for binary, multi-class and multi-label classification. 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC),
Gage, P. (1994). A new algorithm for data compression. The C Users Journal, 12(2), 23-38.
Gao, T., Yao, X., & Chen, D. (2021, November). SimCSE: Simple contrastive learning of sentence embeddings. In M.-F. Moens, X. Huang, L. Specia, & S. W.-t. Yih, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic.
Garcia-Silva, A., & Gomez-Perez, J. M. (2021). Classifying scientific publications with BERT: Is self-attention a feature selection method? In D. Hiemstra, M.-F. Moens, J. Mothe, R. Perego, M. Potthast, & F. Sebastiani, European Conference on Information Retrieval, Virtual Event.
Glänzel, W., & Schubert, A. (2003). A new classification scheme of science fields and subfields designed for scientometric evaluation purposes. Scientometrics, 56(3), 357-367.
Glänzel, W., Schubert, A., & Czerwon, H.-J. (1999). An item-by-item subject classification of papers published in multidisciplinary and general journals using reference analysis. Scientometrics, 44(3), 427-439.
Glushko, R. J., Annechino, R., Hemerly, J., Perry, R., & Wang, L. (2016). Categorization: Describing resource classes and types. In R. J. Glushko (Ed.), The discipline of organizing: Professional edition (pp. 349-403). O'Reilly Media.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning http://www.deeplearningbook.org
Gupta, T., Zaki, M., Krishnan, N. M. A., & Mausam. (2022). MatSciBERT: A materials domain language model for text mining and information extraction. npj Computational Materials, 8(1). https://doi.org/10.1038/s41524-022-00784-w
Hacking, I. (1999). The social construction of what? Harvard University Press.
Haffenden, C., Fano, E., Malmsten, M., & Börjeson, L. (2023). Making and using AI in the library: Creating a BERT model at the National Library of Sweden. College & research libraries, 84(1), 30-48. https://doi.org/10.5860/crl.84.1.30
Hammarfelt, B. (2020). Discipline. Knowledge Organization, 47(3), 244-256. https://doi.org/10.5771/0943-7444-2020-3-244
Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., & Tian, Y. (2024). Training large language models to reason in a continuous latent space. axXiv. https://doi.org/10.48550/arXiv.2412.06769
Hider, P., & Coe, M. (2022). Academic disciplines in the context of library classification: Mapping university faculty structures to the DDC and LCC schemes. Cataloging & classification quarterly, 60(2), 194-213. https://doi.org/10.1080/01639374.2022.2040675
Hinton, G., & Roweis, S. (2002). Stochastic neighbor embedding. Advances in neural information processing systems, 15, 857-864.
Hinton, G. E. (1986). Learning distributed representations of concepts. Proceedings of the Eighth Annual Conference of the Cognitive Science Society, Amherst, MA.
Hjørland, B. (2003). Fundamentals of knowledge organization. Knowledge Organization, 30(2), 87-111.
Hjørland, B. (2007). Semantics and knowledge organization. Annual Review of Information Science and Technology, 41(1), 367-405. https://doi.org/10.1002/aris.2007.1440410115
Hjørland, B. (2022). Science, Part II: The Study of Science. Knowledge Organization, 49(4), 273-300. https://doi.org/10.5771/0943-7444-2022-4-273
Hjørland, B., & Gnoli, C. (2022, 2022-08-31). Research classification system (IEKO). Retrieved September 27, 2022 from https://www.isko.org/cyclo/research
Hjørland, B., & Nissen Pedersen, K. (2005). A substantive theory of classification for information retrieval. Journal of documentation, 61(5), 582-597. https://doi.org/10.1108/00220410510625804
Hsu, T. Y., Liu, C. L., & Lee, H. Y. (2019, Nov). Zero-shot reading comprehension by cross-lingual transfer learning with multi-lingual language representation model. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),, Hong Kong, China.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). Lora: Low-rank adaptation of large language models. The Tenth International Conference on Learning Representations, Online Conference.
Jason, W., Yi, T., Rishi, B., Colin, R., Barret, Z., Sebastian, B., Dani, Y., Maarten, B., Denny, Z., Donald, M., Ed, H. C., Tatsunori, H., Oriol, V., Percy, L., Jeff, D., & William, F. (2022). Emergent abilities of large language models. Transactions on Machine Learning Research. https://openreview.net/forum?id=yzkSU5zdwD
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., & Saulnier, L. (2023). Mistral 7B. arXiv. https://doi.org/10.48550/arXiv.2310.06825
Jisc. (2022). OpenDOAR Statistics. Retrieved June 7, 2022 from https://v2.sherpa.ac.uk/view/repository_visualisations/1.html
Jones, K. S. (1970). Some thoughts on classification for retrieval. Journal of documentation, 26(2), 89-101.
Joorabchi, A., & Mahdi, A. E. (2011). An unsupervised approach to automatic classification of scientific literature utilizing bibliographic metadata. Journal of information science, 37(5), 499-514. https://doi.org/10.1177/0165551511417785
Kao, W. T., & Lee, H. Y. (2021, November). Is BERT a cross-disciplinary knowledge learner? A surprising finding of pre-trained models’ transferability.Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic.
Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. Proceedings of the 31st International Conference on Machine Learning, Beijing, China.
Lee, C., Roy, R., Xu, M., Raiman, J., Shoeybi, M., Catanzaro, B., & Ping, W. (2024). NV-Embed: Improved techniques for training LLMs as generalist embedding models. arXiv. https://doi.org/10.48550/arXiv.2405.17428
Lee, J.-S., & Hsiang, J. (2020). Patent classification by fine-tuning BERT language model. World Patent Information, 61. https://doi.org/10.1016/j.wpi.2020.101965
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240.
Lin, S. C. (2017). Analyses of standard classification of fields based on the directory of faculty expertise open data. Journal of Education Media & Library Science, 54(1), 69-95. https://doi.org/10.6120/JoEMLS.2017.541/0046.RS.AM
Liu, C.-L., Hsu, T.-Y., Chuang, Y.-S., Li, C.-Y., & Lee, H.-y. (2020). Looking for clues of language in multilingual BERT to improve cross-lingual generalization. arXiv. https://doi.org/10.48550/arXiv.2010.10041
Lu, Y. C., & Ke, H. R. (2020). A study on scholars’ perceptions and practices of research data management. Library and Information Science Research, 18(2), 103-137. https://doi.org/10.6182/jlis.202012_18(2).103
Maaten, L. v. d., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9, 2579-2605. https://www.jmlr.org/papers/v9/vandermaaten08a.html
Macauley, P., Evans, T., & Pearson, M. (2011). Classifying Australian PhD bibliographic thesis records by ANZSRC field of research codes. Australian Research Council Research Excellence Branch. http://hdl.handle.net/10536/DRO/DU:30036705
Machlup, F. (1982). Knowledge: Its Creation, Distribution and Economic Significance, Volume II: The Branches of Learning. Princeton University Press.
Mahdi, A. E., & Joorabchi, A. (2011). Automatic subject classification of scientific literature using citation metadata.Digital Enterprise and Information Systems, London, UK.
McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv. https://doi.org/10.48550/arXiv.1802.03426
Meo-Evoli, L., Negrini, G., & Farnesi, T. (1998). ICC and ICS: Comparison and relations between two systems based on different principles. Advances in Knowledge Organization, Lille, France.
Midtgarden, T. (2020). Peirce’s Classification of the Sciences. Knowledge Organization, 47(3), 267-278. https://doi.org/10.5771/0943-7444-2020-3-267
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. Proceeding of the International Conference on Learning Representations Workshop. https://doi.org/10.48550/arXiv.1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 3111-3119. https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf
Mikolov, T., Yih, W. T., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, 746-751. https://www.aclweb.org/anthology/N13-1090/
Milojević, S. (2020). Practical method to reclassify Web of Science articles into unique subject categories and broad disciplines. Quantitative Science Studies, 1(1), 183-206. https://doi.org/10.1162/qss_a_00014
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., & Gao, J. (2021). Deep learning-based text classification. ACM Computing Surveys, 54(3), 1-40. https://doi.org/10.1145/3439726
Ministry of Education Republic of China (Taiwan). (2017). R.O.C. Standard Classification of Education Fields (5th Amendment). Ministry of Education Republic of China (Taiwan).
Ministry of Science and Technology. (2020). Indicators of Science and Technology. https://wsts.most.gov.tw/stsweb/technology/TechnologyDataIndex.aspx
Moreira, G. d. S. P., Osmulski, R., Xu, M., Ak, R., Schifferer, B., & Oldridge, E. (2024). NV-Retriever: Improving text embedding models with effective hard-negative mining. arXiv. https://doi.org/https://doi.org/10.48550/arXiv.2407.15831
Ng, A. (2020). The batch: Youtube vs. conspiracy theorists robots that think ahead gpu cpu the future transformers retooled. Retrieved Sept 20, 2020 from https://blog.deeplearning.ai/blog/the-batch-youtube-vs.-conspiracy-theorists-robots-that-think-ahead-gpu-cpu-the-future-transformers-retooled
OECD. (2015). Frascati manual 2015: Guidelines for collecting and reporting data on research and experimental development. OECD Publishing. https://doi.org/10.1787/9789264239012-en
Ostendorff, M., Rethmeier, N., Augenstein, I., Gipp, B., & Rehm, G. (2022, December). Neighborhood contrastive learning for scientific document representations with citation embeddings. In Y. Goldberg, Z. Kozareva, & Y. Zhang, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates.
Paccanaro, A., & Hinton, G. E. (1986). Learning distributed representations of concepts. Proceedings of the eighth annual conference of the cognitive science society, 1, 1-12.
Park, W. (2017). Abduction in context: The conjectural dynamics of scientific reasoning (L. Magnani, Ed.). Springer International Publishing. https://doi.org/10.1007/978-3-319-48956-8
Peirce, C. S. (1878). Deduction, induction, and hypothesis. Popular Science Monthly, 13, 470-482.
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 2227-2237. https://doi.org/10.18653/v1/N18-1202
Porter, S. J., Hawizy, L., & Hook, D. W. (2023). Recategorising research: Mapping from FoR 2008 to FoR 2020 in Dimensions. Quantitative Science Studies, 4(1), 127-143. https://doi.org/10.1162/qss_a_00244
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. https://github.com/openai/gpt-2
Reimers, N., & Gurevych, I. (2019, November 3-7, 2019). Sentence-BERT: Sentence embeddings using siamese BERT-networks. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China.
Reimers, N., & Gurevych, I. (2020, nov). Making monolingual sentence embeddings multilingual using knowledge distillation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online conference.
Richardson, E. C. (1901). Classification, theoretical and practical: Together with an appendix containing as essay towards a bibliographical history of systems of classification. Charles Scribner's Sons.
Rosch, E. (1978). Principles of Categorization. In E. Rosch & B. B. Lloyd (Eds.), Cognition and categorization (pp. 27-48). Lawrence Erlbaum.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.
Russell, S. J., & Norvig, P. (2010). Artificial intelligence: A modern approach (3rd ed.). Pearson Education, Inc.
Schommer-Aikins, M., Duell, O. K., & Barker, S. (2003). Epistemological beliefs across domains using Biglan's classification of academic disciplines. Research in Higher Education, 44, 347-366.
Science & Technology Policy Research and Information Center. (2021). TDP Yearbook 2020. https://www.most.gov.tw/most/attachments/a8252d96-f688-4c26-af72-b51688dd34d3
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47.
Sennrich, R., Haddow, B., & Birch, A. (2016, August). Neural machine translation of rare words with subword units. In K. Erk & N. A. Smith, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Berlin, Germany.
Sharir, O., Peleg, B., & Shoham, Y. (2020). The cost of training NLP models: A concise overview. arXiv. https://doi.org/10.48550/arXiv.2004.08900
Shen, S., Liu, J., Lin, L., Huang, Y., Zhang, L., Liu, C., Feng, Y., & Wang, D. (2022). SsciBERT: A pre-trained language model for social science texts. Scientometrics, 128(2), 1241-1263. https://doi.org/10.1007/s11192-022-04602-4
Shera, J. H. (1951). Classification as the basis of bibliographic organization. In J. H. Shera & M. E. Egan (Eds.), Bibliographic organization: Papers presented before the Fifteenth Annual Conference of the Graduate Library School (pp. 72-93). University of Chicago Press.
Sīle, L., Guns, R., Vandermoere, F., Sivertsen, G., & Engels, T. C. E. (2021). Tracing the context in disciplinary classifications: A bibliometric pairwise comparison of five classifications of journals in the social sciences and humanities. Quantitative Science Studies, 2(1), 65-88. https://doi.org/10.1162/qss_a_00110
Singh, A., D’Arcy, M., Cohan, A., Downey, D., & Feldman, S. (2023, December). SciRepEval: A multi-format benchmark for scientific document representations. In H. Bouamor, J. Pino, & K. Bali, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing Singapore.
Sivertsen, G. (2019). Developing Current Research Information Systems (CRIS) as data sources for studies of research. In W. Glänzel, H. F. Moed, U. Schmoch, & M. Thelwall (Eds.), Springer handbook of science and technology indicators (pp. 667-683). Springer. https://doi.org/10.1007/978-3-030-02511-3
Sivertsen, G., & Larsen, B. (2012). Comprehensive bibliographic coverage of the social sciences and humanities in a citation index: An empirical analysis of the potential. Scientometrics, 91(2), 567-575. https://doi.org/10.1007/s11192-011-0615-3
Song, K., Tan, X., Qin, T., Lu, J., & Liu, T.-Y. (2020). Mpnet: Masked and permuted pre-training for language understanding. Advances in neural information processing systems, 33, 16857-16867.
Springer, J. M., Kotha, S., Fried, D., Neubig, G., & Raghunathan, A. (2024). Repetition improves language model embeddings. arXiv. https://doi.org/10.48550/arXiv.2402.15449
Svenonius, E. (2019). Bibliographic entities and their uses. Cataloging & classification quarterly, 56(8), 711-724. https://doi.org/10.1080/01639374.2018.1524284
Szostak, R. (2004). Classifying science: Phenomena, data, theory, method, practice (Vol. 7). Springer Science & Business Media.
Szostak, R. (2008). Classification, interdisciplinarity, and the study of science. Journal of documentation, 64(3), 319-332. https://doi.org/10.1108/00220410810867551
Thakur, N., Reimers, N., Daxenberger, J., & Gurevych, I. (2021). Augmented SBERT: Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 296-310. https://doi.org/10.18653/v1/2021.naacl-main.28
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., & Azhar, F. (2023). Llama: Open and efficient foundation language models. arXiv. https://doi.org/10.48550/arXiv.2302.13971
Tseng, Y. H. (2020). The feasibility of automated topic analysis: An empirical evaluation of deep learning techniques applied to skew-distributed Chinese text classification. Journal of Educational Media & Library Sciences, 57(1), 121-144.
UNESCO Institute of Statistics. (2023). International Standard Classification of Education (ISCED). Retrieved Sep 14, 2023 from https://uis.unesco.org/en/topic/international-standard-classification-education-isced
Vancauwenbergh, S., & Poelmans, H. (2019). The Flemish research discipline classification standard: A practical approach. Knowledge Organization, 46(5), 354-363. https://doi.org/10.5771/0943-7444-2019-5-354
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30, 5998-6008.
Wang, Y., Che, W., Guo, J., Liu, Y., & Liu, T. (2019, Nov). Cross-lingual bert transformation for zero-shot dependency parsing. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China.
Warner, B., Chaffin, A., Clavié, B., Weller, O., Hallström, O., Taghadouini, S., Gallagher, A., Biswas, R., Ladhak, F., & Aarsen, T. (2024). Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. arXiv. https://doi.org/10.48550/arXiv.2412.13663
Wilson, P. (1968). Two kinds of power: An essay on bibliographical control. University of California Press.
Wu, M., Sivertsen, G., Zhang, L., Qi, F., & Zhang, Y. (2024). Scientific progress or societal progress? A language model based classification of the aims of the research in scientific publications. https://doi.org/10.2139/ssrn.4810350
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., & Macherey, K. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv. https://doi.org/10.48550/arXiv.1609.08144
Yang, Y., Uy, M. C. S., & Huang, A. (2020). FinBERT: A pretrained language model for financial communications. arXiv. https://doi.org/10.48550/arXiv.2006.08097
Young, F. W., & Hamer, R. M. (1987). Multidimensional scaling: History, theory, and applications. Lawrence Erlbaum Associates, Inc.
Yuan, D. Y., & Tang, M. C. (2010). Exploring intellectual network structure of an interdisciplinary research community: A case study of Taiwan's STS community. Journal of Library and Information Science Research, 8(2), 125-163. https://doi.org/10.6182/jlis.2010.8(2).125
Zeng, M. L. (2018). Interoperability (IEKO). Retrieved May 24, 2022 from https://www.isko.org/cyclo/interoperability.htm
Zeng, M. L. (2019). Interoperability. Knowledge Organization, 46(2), 122-146. https://doi.org/10.5771/0943-7444-2019-2-122
Zeng, M. L., & Chan, L. M. (2006). Metadata interoperability and standardization: A study of methodology, Part II. D-lib Magazine, 12(6), 1082-9873. https://doi.org/10.1045/june2006-zeng
Zhang, L., Sun, B., Shu, F., & Huang, Y. (2022). Comparing paper level classifications across different methods and systems: An investigation of Nature publications. Scientometrics, 127(12), 7633-7651. https://doi.org/10.1007/s11192-022-04352-3
Zhang, S., Wu, M., & Zhang, X. (2023). Utilising a large language model to annotate subject metadata: A case study in an Australian national research data catalogue. arXiv. https://doi.org/10.48550/arXiv.2310.11318
Zhu, X., Xu, Q., Chen, Y., Chen, H., & Wu, T. (2020). A novel class-center vector model for text classification using dependencies and a semantic dictionary. IEEE Access, 8, 24990-25000. https://doi.org/10.1109/access.2019.2954106
林信成、蕭勝文(2005)。模糊分類應用於圖書館資訊選粹服務系統。圖書與資訊學刊,52,23-41。https://doi.org/10.6575/JoLIS.2005.52.03【Lin, S. C. & Hsiao, S. W. (2017). Application of fuzzy classification to the library system of selective dissemination of information. Bulletin of Library and Information Science, 52, 23-41. https://doi.org/10.6575/JoLIS.2005.52.03 (in Chinese)】
林頌堅(2017)。以開放資料的教師學術專長彙整表為基礎之學科標準分類分析。教育資料與圖書館學,54(1),69-95。https://doi.org/10.6120/JoEMLS.2017.541/0046.RS.AM【Lin, S. C. (2017). Analyses of standard classification of fields based on the directory of faculty expertise open data. Journal of Education Media & Library Science, 54(1), 69-95. https://doi.org/10.6120/JoEMLS.2017.541/0046.RS.AM (in Chinese)】
科技政策研究與資訊中心(2021)。109年度中央政府科技研發績效彙編。財團法人國家實驗研究院。https://www.most.gov.tw/most/attachments/a8252d96-f688-4c26-af72-b51688dd34d3【Science & Technology Policy Research and Information Center. (2021). TDP Yearbook 2020. National Applied Research Laboratories. https://www.most.gov.tw/most/attachments/a8252d96-f688-4c26-af72-b51688dd34d3 (in Chinese)】
袁大鈺、唐牧群(2010)。跨領域學術社群之智識網絡結構初探:以臺灣科技與社會研究為例。圖書資訊學刊,8(2),125-163。doi: 10.6182/jlis.2010.8(2).125。【Yuan, D. Y. & Tang, M. C. (2010). Exploring intellectual network structure of an interdisciplinary research community: A case study of Taiwan's STS community. Journal of Library and Information Science Research, 8(2), 125-163. doi: 10.6182/jlis.2010.8(2).125 (in Chinese)】
陳郁文(2013)。學術名詞學科分類架構建置之研究(未出版之碩士論文)。國立臺灣大學,台北市。【Chen, Y. W. (2013). A study on the construction of classification structure for academic terms [Unpublished master’s thesis]. National Taiwan University, Taipei. (in Chinese)】
陳雪華、林慶隆、陳建民、陳郁文、邱重毅、何亞真(2012)。學術名詞翻譯之學科分類架構探討研究報告。國家教育研究院。http://wd.naer.edu.tw/project/NAER-101-12-F-2-04-00-2-02.pdf。 【Chen, S. H., Lin, C. L., Chen, C. M., Chen, Y. W., Chiou, C. Y., & Ho, Y. Z. (2012). Xue shu ming ci fan yi zhi xue ke fen lei jia gou tan tao yan jiu bao gao. National Academy for Educational Research. http://wd.naer.edu.tw/project/NAER-101-12-F-2-04-00-2-02.pdf (in Chinese)】
陸怡靖、柯皓仁(2020)。學者研究資料管理認知與實踐之研究。圖書資訊學刊,18(2),103-137。https://doi.org/10.6182/jlis.202012_18(2).103。【Lu, Y. C. & Ke, H. R. (2020). A study on scholars’ perceptions and practices of research data management. Journal of Library and Information Science Research, 18(2), 103-137. https://doi.org/10.6182/jlis.202012_18(2).103 (in Chinese)】
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97847-
dc.description.abstract本研究應用大語言模型與類別對應,建立跨國學科分類表間之自動文件分類。首先以藉由新舊分類表版本間的類別對應,以擴充訓練英文文件集,達致分類效能之提升;再以對比學習將分類能力由英文文件擴展至中文文件;最後使用類別對應將分類能力拓展至臺灣所使用之分類表。分類為圖書館實務常見的書目工具,常規作法是將文件分類至單一分類表,當使用不同分類表的二個機構欲進行館藏評鑑或學術評比時,需先將文件再分類至單一分類法上方能比較機構間之文件集,類別對應為實務常見的再分類方法,本研究進一步結合大語言模型用於跨語言間的類別對應。本研究目標是建構互通模型,用於對應澳洲紐西蘭標準研究分類表 (簡稱澳紐分類表)與臺灣國家科學與技術委員會學門專長分類表(簡稱國科會分類表)之部份類別。選用澳紐分類表為其具有與國科會分類表相仿之三層分類架構,且有大量已分類文件可供訓練分類器,已分類文件來源為澳大利亞地區之機構典藏系統,中文文本則源自台灣各學術單位之機構典藏系統。研究步驟首先以正規方法定義類別間之三種對應關係,分別為無關、完全對應、與部份對應,再以對應關係形成擴展文件集,用以訓練與評估用於澳洲紐西蘭標準研究分類表之分類器,過往文獻中常見之自動研究分類方法皆納入實驗,包含SVM為代表之傳統機器學習演算法、fastText為代表之靜態詞嵌入方法、BERT為代表之編碼器架構、NV-embed為代表之文件向量模型、Meta LLAMA為代表之解碼器模型,及使用對比學習微調之大語言模型,對比學習用於建立澳紐分類表之分類器,以及將英文分類能力延展至中文文本。結果顯示,深度學習模型大多優於傳統機器學習模型,比較傳統SVM及Meta LLAMA 3.1之表現,在大類層級之Macro-F1提升至少7%,中類提升至少9%,小類提升至少17%,惟原始BERT在千類類別之小類表現極弱;語言模型之參數量越多,則分類效能越佳,比較ModernBERT-large及ModernBERT-base,大類提升1.0%至2.5%,中類提升2.2%至4.5%,小類提升9.9%至11.5%;以擴展文件集訓練之模型表現較好。綜合而論,對於澳紐分類表之大類或中類層級的表現以未微調之NV-embed最佳,而在小類層級,採用對比澳紐分類表之類別方式微調之模型效能最佳,此分類器亦具有分類中文文件至澳紐分類表之能力,顯示即使未以雙語對照訓練,經英文文本對澳紐分類表做對照微調訓練,仍可提升大語言模型對中文文本的分類能力;最終,本研究展示該模型分類中文文件至國科會分類表之部份類別。本研究之分類器在實務可用於推薦紐澳分類表類別,結果顯示在推薦至多九個小類類別,可完全預測出八成文件之已標註類別。整體而言,本研究將文件向量視為文件題名及摘要文字之平均字符向量,而類別向量為類別內之平均文件向量,類別向量可用於測量類別間相近性,實務上可做為更新分類表之參考。本研究以大語言模型中的「表徵」,初探「類別」與「分類」之概念,未來將進一步探究大語言模型中「類別表徵」與「類別」之關係,期能助益實務界妥善利用人工智慧於圖書館服務。zh_TW
dc.description.abstractThis study applies large language models (LLM) and class mapping to achieve automatic document classification for the research classification systems. By leveraging class mapping between two versions of a classification system, the English document training set is extended to achieve improved classification performance. Then, contrastive learning is used to propagate the classification capability from English documents to Chinese documents. Finally, class correspondence is employed to extend the classification capability to Taiwan's classification system. Classification is a common bibliographic tool in library practice and conventionally a document is classified into a single classification system. When two institutions using different classification systems manage to conduct collection conceptus or academic performance evaluation, documents need to be reclassified into a single classification system. Class correspondence tables are a common reclassification method in library practice. This study further combines large language models for cross-language class correspondence. The research objective is to construct an interoperable model for corresponding selected classes between the Australian and New Zealand Standard Research Classification (ANZSRC) and Taiwan's National Science and Technology Council Academic Expertise Classification (NSTC Classification). The ANZSRC was selected because it has a three-tier classification structure similar to the NSTC Classification and has abundant classified documents available for training classifiers. The classified documents come from institutional repository systems in Australia, while Chinese texts are sourced from institutional repository systems in Taiwan. The research procedure first formally defines three types of class mapping relation: non-mapped, possibly-mapped, and definitely-mapped. These class mapping relationships are then used to form expanded document sets for training and evaluating classifiers for the ANZSRC. The classification methods from previous literature are all included in the experiments, including traditional machine learning algorithms represented by SVM, static word embedding methods represented by fastText, encoder architectures represented by BERT, document vector models represented by NV-embed, decoder models represented by Meta LLAMA, and LLMs fine-tuned using contrastive learning. Contrastive learning is used to build ANZSRC classifiers and extend English classification capabilities to Chinese texts. Results show that deep learning models mostly outperform traditional machine learning models. Comparing traditional SVM and Meta LLAMA 3.1 performance, Macro-F1 at the division level of the ANZSRC FoR improved by at least 7%, group level by at least 9%, and field level by at least 17%, though original BERT performed extremely poorly on field level. The more parameters a LLM has, the better its classification performance. Comparing ModernBERT-large and ModernBERT-base, the division level is improved by 1.0% to 2.5%, the group level by 2.2% to 4.5%, and the field level by 9.9% to 11.5%. Models trained on expanded document sets generally perform better. In summary, for division or group levels of ANZSRC, NV-embed, which is not finetuned in this study, performs best; while at the field level, models fine-tuned by contrasting the ANZSRC field classes achieve optimal performance. This classifier has the capability to classify Chinese documents into ANZSRC, showing that even without training by language alignment, contrastive fine-tuning on English texts for ANZSRC can still improve the LLMs' classification ability for Chinese texts. Finally, this study demonstrates the model's classification of Chinese documents into selected classes of the NSTC Classification. The classifiers can be practically used to recommend ANZSRC classes. Results show that when recommending up to 9 field classes, the classes of 80% documents can be completely predicted. To conclude, the document representation and class representation is the core notion for incorporating modern LLM into automatic research classification. A document vector is the average token vectors from the title and abstract text, while a class vector is the average vectors of the in-class documents. Class vectors can be used to measure similarity between class and can serve as a reference for updating classification systems in practice. Future studies on the relationship between 'class representations' and 'class' is essential for practitioners applying artificial intelligence, particularly LLM techniques, to library services.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-07-18T16:09:11Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2025-07-18T16:09:11Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontents口試委員審定書 i
摘要 iii
Abstract v
Contents ix
Figures xiii
Tables xv
Chapter 1 Introduction 1
1.1 Background 1
1.2 Motivation 3
1.3 Research Questions 4
1.4 Definition 7
1.4.1 Terminology 7
1.4.2 Notations 10
1.4.3 Model Cards 12
1.4.4 Tag of Contrastive Learning 12
Chapter 2 Literature Review 13
2.1 Research Classification 13
2.1.1 Scientific Classification and Classification of Science 13
2.1.2 Discipline, Education, and Research Classification 15
2.1.3 Knowledge Organization and Research Classification 18
2.1.4 Abduction in classification 22
2.2 Classification Scheme 25
2.2.1 ANZSRC 25
2.2.2 Academic Expertise Classification (AEC) 26
2.2.3 Comparison between ANZSRC FoR and AEC 28
2.2.4 Automatic Research Classification 36
2.3 Automatic Document Classification 38
2.3.1 Data Representation 38
2.3.2 NLP Models for Text Classification 43
2.3.3 Neural Network Models 44
2.3.4 Classification Performance and Threshold 46
2.4 Large Language Models 48
2.4.1 Deep Learning for Natural Language Processing 48
2.4.2 Foundation Model and Transformer 51
2.4.3 Decoder-to-Encoder and LoRA 55
2.4.4 Contrastive Learning and Transfer Learning 56
Chapter 3 Research Design 59
3.1 Research Architecture 59
3.2 Class Mapping 60
3.3 Data Collection 63
3.3.1 Classification Schemes 63
3.3.2 Bibliographic records 64
3.3.3 Data Cleaning 65
3.3.4 Dataset Packaging 68
3.4 Model Training 69
3.4.1 Build Language Model 70
3.4.2 Predict 71
3.4.3 Classify 75
3.4.4 Contrastive Learning 77
Chapter 4 Result 81
4.1 Datasets 81
4.1.1 Harvest 81
4.1.2 Cleaning 82
4.1.3 Dataset Profile 84
4.1.4 TAIR Dataset 85
4.2 Classification Performance 86
4.2.1 Distributional Representation 89
4.2.2 Model Architecture, Pretrained Corpora, and Finetune Dataset 93
4.2.3 Predict and Classify 98
4.2.4 Contrastive Learning 104
4.3 Crosswalks 110
4.3.1 Class Similarity in ANZSRC FoR 110
4.3.2 Similarity Across Languages 110
4.3.3 Inference with Chinese Text 111
4.3.4 Inference Across Scheme 112
4.4 Discussion 117
4.4.1 Outcomes and Research Objectives 117
4.4.2 Feature Engineering and Representation 119
4.4.3 Class, Class Mapping, Classification, and Classifiers 121
4.4.4 Abduction, Prototype, and Family Resemblance 123
Chapter 5 Conclusion 125
5.1 Summary of Finding 125
5.2 Contribution 127
5.3 Limitation 129
5.4 Suggestion and Future Study 130
References 135
-
dc.language.isoen-
dc.subject自動分類zh_TW
dc.subject知識表徵zh_TW
dc.subject機構典藏zh_TW
dc.subject研究分類zh_TW
dc.subject深度學習zh_TW
dc.subject自動分類zh_TW
dc.subject知識表徵zh_TW
dc.subject機構典藏zh_TW
dc.subject研究分類zh_TW
dc.subject深度學習zh_TW
dc.subjectAutomatic classificationen
dc.subjectKnowledge representationen
dc.subjectAutomatic classificationen
dc.subjectDeep learningen
dc.subjectResearch classificationen
dc.subjectInstitutional repositoryen
dc.subjectKnowledge representationen
dc.subjectInstitutional repositoryen
dc.subjectResearch classificationen
dc.subjectDeep learningen
dc.title應用深度學習於學科分類與互通框架之研究zh_TW
dc.titleA Study on Applying Deep Learning in Disciplines Classification and Crosswalk Frameworken
dc.typeThesis-
dc.date.schoolyear113-2-
dc.description.degree博士-
dc.contributor.oralexamcommittee唐牧群;楊東謀;林信成;柯皓仁zh_TW
dc.contributor.oralexamcommitteeMuh-Chyun Tang;Tung-Mou Yang;Sinn-Cheng Lin;Hao-Ren Keen
dc.subject.keyword自動分類,深度學習,研究分類,機構典藏,知識表徵,zh_TW
dc.subject.keywordAutomatic classification,Deep learning,Research classification,Institutional repository,Knowledge representation,en
dc.relation.page146-
dc.identifier.doi10.6342/NTU202501815-
dc.rights.note同意授權(限校園內公開)-
dc.date.accepted2025-07-17-
dc.contributor.author-college文學院-
dc.contributor.author-dept圖書資訊學系-
dc.date.embargo-lift2025-07-19-
顯示於系所單位:圖書資訊學系

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務)
6.31 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved