學習語言模式而非分佈：以鏈式思考在語言模型中透過歸納偏置促進泛化

王淯; Yu Wang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101585

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	吳沛遠	zh_TW
dc.contributor.advisor	Pei-Yuan Wu	en
dc.contributor.author	王淯	zh_TW
dc.contributor.author	Yu Wang	en
dc.date.accessioned	2026-02-11T16:33:25Z	-
dc.date.available	2026-02-12	-
dc.date.copyright	2026-02-11	-
dc.date.issued	2026	-
dc.date.submitted	2026-01-30	-
dc.identifier.citation	[1] E. Akyürek, D. Schuurmans, J. Andreas, T. Ma, and D. Zhou. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022. [2] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013. [3] G. C. Calderón, E. Allaway, B. Haddow, and A. Birch. Generics are puzzling. can language models find the missing piece? arXiv preprint arXiv:2412.11318, 2024. [4] Y. Chang and Y. Bisk. Language models need inductive biases to count inductively. arXiv preprint arXiv:2405.20131, 2024. [5] M. Du, F. He, N. Zou, D. Tao, and X. Hu. Shortcut learning of large language models in natural language understanding. Communications of the ACM, 67(1):110–120, 2023. [6] M. Geiger, A. Jacot, S. Spigler, F. Gabriel, L. Sagun, S. d＇Ascoli, G. Biroli, C. Hongler, and M. Wyart. Scaling description of generalization with numberof parameters in deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2020(2):023401, 2020. [7] R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020. [8] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020. [9] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. [10] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022. [11] S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representations revisited. In International conference on machine learning, pages 3519–3529. PMlR, 2019. [12] M. Lamar, Y. Maron, and E. Bienenstock. Latent-descriptor clustering for unsupervised pos induction. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 799–809, 2010. [13] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024. [14] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei, and A. Roberts. The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning, pages 22631–22648. PMLR, 2023. [15] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. [16] E. Malach. Auto-regressive next-token predictors are universal learners. arXiv preprint arXiv:2309.06979, 2023. [17] R. T. McCoy, E. Pavlick, and T. Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. arXiv preprint arXiv:1902.01007, 2019. [18] J. Mehrer, C. J. Spoerer, N. Kriegeskorte, and T. C. Kietzmann. Individual differences among deep neural network models. Nature communications, 11(1):5725, 2020. [19] S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work?, 2022. [20] S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain, and J. Gao. Large language models: A survey. arXiv preprint arXiv:2402.06196, 2024. [21] T. M. Mitchell. The need for biases in learning generalizations. Rutgers CS tech report CBM-TR-117, 1980. [22] B. K. Natarajan. On learning sets and functions. Machine Learning, 4(1):67–97, 1989. [23] M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski, J. Austin, D. Bieber, D. Dohan, A. Lewkowycz, M. Bosma, D. Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021. [24] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022. [25] V. Papyan, X. Han, and D. L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020. [26] O. Press, N. A. Smith, and M. Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021. [27] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. [28] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020. [29] K. Stratos, M. Collins, and D. Hsu. Unsupervised part-of-speech tagging with anchor hidden markov models. Transactions of the Association for Computational Linguistics, 4:245–257, 2016. [30] R. Tang, D. Kong, L. Huang, and H. Xue. Large language models can be lazy learners: Analyze shortcuts in in-context learning. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4645–4657, 2023. [31] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm.stanford.edu/2023/03/13/alpaca.html, 3(6):7, 2023. [32] V. Vakilian, S. Mahdavi, and C. Thrampoulidis. How much context does natural language actually require? an analysis using llms as statistical oracles. In ICML 2025 Workshop on Long-Context Foundation Models, 2025. [33] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984. [34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [35] J. Walsh, S. Mamidanna, B. Nye, M. Core, and D. Auerbach. Fine-tuning for better few shot prompting: An empirical comparison for short answer grading. arXiv preprint arXiv:2508.04063, 2025. [36] L. Wang, L. Hu, J. Gu, Z. Hu, Y. Wu, K. He, and J. Hopcroft. Towards understanding learning representations: To what extent do different neural networks learn the same representation. Advances in neural information processing systems, 31, 2018. [37] S. Wang, H. Fang, M. Khabsa, H. Mao, and H. Ma. Entailment as few-shot learner. arXiv preprint arXiv:2104.14690, 2021. [38] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022. [39] A. Warstadt, A. Parrish, H. Liu, A. Mohananey, W. Peng, S.-F. Wang, and S. R. Bowman. Blimp: The benchmark of linguistic minimal pairs for english. Transactions of the Association for Computational Linguistics, 8:377–392, 2020. [40] A. Warstadt, A. Singh, and S. R. Bowman. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 2019. [41] M. Watson, B. A. S. Hasan, and N. Al Moubayed. Agree to disagree: When deep learning models with identical architectures produce distinct explanations. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 875–884, 2022. [42] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022. [43] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. [44] J. C. White and R. Cotterell. Examining the inductive bias of neural language models with artificial languages. arXiv preprint arXiv:2106.01044, 2021. [45] S. M. Xie, A. Raghunathan, P. Liang, and T. Ma. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021. [46] Y. Yuan, L. Zhao, K. Zhang, G. Zheng, and Q. Liu. Do llms overcome shortcut learning? an evaluation of shortcut challenges in large language models. arXiv preprint arXiv:2410.13343, 2024. [47] T. Zhang and T. B. Hashimoto. On the inductive bias of masked language modeling: From statistical to syntactic dependencies. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5131–5146, 2021. [48] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36:55006–55021, 2023. [49] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. Le, and E. Chi. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101585	-
dc.description.abstract	大型語言模型（LLMs）常常能夠泛化至超出訓練分布的資料，然而其歸納偏置（inductive bias）如何受到資料的影響仍未被充分理解。本研究從理論觀點出發，探討自迴歸模型如何根據特定資料集建構其歸納偏置。我們提出「最小上下文假說」（Minimal Context Hypothesis），主張語言模型在進行下一詞預測時，可透過辨識一組最小前綴詞（minimal contexts）來進行建模；這些上下文經由數學結構（part-of-math, POM）所誘導出的語言模式定義，能夠使模型在僅依賴這些最小上下文的情況下，忽略其餘前綴資訊而達成準確預測。為了形式化此概念，我們提出「自迴歸最小上下文搜尋機器」（Autoregressive Minimal Context Searching Machine, AMCSM），一種建立於 POM 結構之上的抽象框架，描述自迴歸模型如何在資料分布中隱式搜尋滿足條件獨立性的最小上下文集合。此理論框架下，最小上下文可視為根據資料分布導出的最小前綴子集，並反映語言模型對歸納偏置的建構方式。我們亦設計一系列合成乘法任務進行實驗，證實鏈式思考（chain-of-thought, CoT）監督可穩定此基於最小上下文的歸納偏置，進而促進模型對訓練分布之外資料的泛化能力。進一步的顯著性分析亦顯示，在推理過程中，模型的敏感性主要集中於這些最小上下文，呼應我們的理論預測。	zh_TW
dc.description.abstract	Large language models (LLMs) often generalize beyond their training distributions, yet how their inductive biases are shaped by data remains poorly understood. We study this question from a theoretical perspective by formalizing how an autoregressive model may determine its inductive bias from a given dataset. We propose the Minimal Context Hypothesis, which posits that next-token prediction in LLMs can be characterized by identifying small subsets of prefix tokens—minimal contexts—identified via a formally defined language pattern induced by part-of-math (POM) structures, such that conditioning on these minimal contexts alone renders the remaining prefix conditionally irrelevant. To formalize this idea, we introduce the Autoregressive Minimal Context Searching Machine (AMCSM), a conceptual abstraction grounded in POM structures that describes how an autoregressive learner may implicitly search for minimal conditioning sets determined by the data distribution. Within this framework, minimal contexts emerge as the smallest prefix subsets satisfying a conditional independence criterion for next-token prediction, thereby yielding a distribution-dependent characterization of inductive bias. We complement our theoretical formulation with controlled experiments on synthetic multiplication tasks, demonstrating that chain-of-thought (CoT) supervision stabilizes this minimal-context-based inductive bias and enables generalization to inputs outside the support of the training distribution. Saliency analyses further corroborate our theory by showing that model sensitivity concentrates on minimal contexts during intermediate reasoning steps.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-02-11T16:33:25Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2026-02-11T16:33:25Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Acknowledgements ..................................................... i 摘要 ................................................................. iii Abstract ............................................................. v Contents ............................................................. vii List of Figures ...................................................... xi Chapter 1 Introduction .............................................. 1 Chapter 2 Related Works ............................................. 5 2.1 Minimal Context in Language Modeling ........................... 5 2.2 Inductive Bias in LLMs ......................................... 6 2.3 Part-of-Speech and Part-of-Math Structures in LLMs ............. 6 Chapter 3 Experiments ............................................... 9 3.1 Experimental Setup ............................................ 10 3.2 Problem Distribution Setting .................................. 11 3.3 The Effect of CoT ............................................ 12 3.4 The Effect of ICL ............................................ 14 3.5 The Effect of Zero Padding .................................... 15 3.6 Summary of Experimental Findings .............................. 17 Chapter 4 The Minimal Context Hypothesis ........................... 19 4.1 Saliency Analysis of Minimal Context Digits ................... 20 4.2 Role of Operator Tokens in Minimal Contexts ................... 22 4.3 Zero as a Functional Padding Token in Dpad2 ................... 25 4.4 Summary of the Minimal Context Hypothesis ..................... 26 Chapter 5 Theoretical Analysis ..................................... 29 5.1 Non-Uniqueness of POM Structures .............................. 29 5.2 Language Patterns ............................................. 32 5.3 Autoregressive Minimal Context Searching Machine (AMCSM) ..... 34 Chapter 6 Discussion ............................................... 41 Chapter 7 Limitations .............................................. 43 Chapter 8 Conclusions .............................................. 45 References .......................................................... 47 Appendix A — Experiments Implementation Details ..................... 55 A.1 GPT-2 Model Configuration ..................................... 55 A.2 Training Details .............................................. 56 A.3 1-Shot ICL-direct Data Format ................................ 56 A.4 1-Shot ICL-CoT Data Format ................................... 56 A.5 CoT Data Format without ICL (0-shot-CoT data format) ......... 59 Appendix B — Detailed ICL Examples .................................. 61 B.1 Correct Three-Digit ICL Examples .............................. 61 B.2 Failure Examples in Subsection 3.5 ............................ 63 Appendix C — Detailed Examples in Section 4 ........................ 73 Appendix D — Mixed-Operation Arithmetic Dataset ..................... 77 D.1 Problem Distribution Setting ................................. 77 D.2 Experimental Setup ........................................... 79 Appendix E — Examples of Language Patterns ......................... 81	-
dc.language.iso	en	-
dc.subject	鏈式思考	-
dc.subject	歸納偏置	-
dc.subject	分布外	-
dc.subject	語言模型	-
dc.subject	最小上下文	-
dc.subject	Chain-of-Thought	-
dc.subject	Inductive Bias	-
dc.subject	Out-of-Distribution	-
dc.subject	Language Models	-
dc.subject	Minimal Context	-
dc.title	學習語言模式而非分佈：以鏈式思考在語言模型中透過歸納偏置促進泛化	zh_TW
dc.title	Learning Language Patterns, Not Distributions: Chain-of-Thought Enables Generalization via Inductive Bias in Language Models	en
dc.type	Thesis	-
dc.date.schoolyear	114-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	李宏毅;于天立;李彥寰;黃文良	zh_TW
dc.contributor.oralexamcommittee	Hung-Yi Lee;Tian-Li Yu;Yen-Huan Li;Wen-Liang Hwang	en
dc.subject.keyword	鏈式思考,歸納偏置分布外語言模型最小上下文	zh_TW
dc.subject.keyword	Chain-of-Thought,Inductive BiasOut-of-DistributionLanguage ModelsMinimal Context	en
dc.relation.page	87	-
dc.identifier.doi	10.6342/NTU202600102	-
dc.rights.note	未授權	-
dc.date.accepted	2026-02-02	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電信工程學研究所	-
dc.date.embargo-lift	N/A	-
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-114-1.pdf 未授權公開取用	787.01 kB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。