高效訓練與策略可解釋之深度強化學習動態機台配置設計：以半導體金屬濺鍍機群為例

許心慈; Hsin-Tzu Hsu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91862

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	張時中	zh_TW
dc.contributor.advisor	Shi-Chung Chang	en
dc.contributor.author	許心慈	zh_TW
dc.contributor.author	Hsin-Tzu Hsu	en
dc.date.accessioned	2024-02-23T16:20:32Z	-
dc.date.available	2024-02-24	-
dc.date.copyright	2024-02-23	-
dc.date.issued	2024	-
dc.date.submitted	2024-01-22	-
dc.identifier.citation	[AbN04] Abbeel, P., & Ng, A. Y. (2004, July). Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning (p. 1). [ADB17] Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: a brief survey. IEEE Signal Processing Magazine, 34(6), 26-38. [ASW20] Altenmüller, T., Stüker, T., Waschneck, B., Kuhnle, A., & Lanza, G. (2020). Reinforcement learning for an intelligent and autonomous production control of complex job-shops under time constraints. Production Engineering, 14, 319-328. [Blu14] Bluman, A. (2014). Elementary Statistics: A step by step approach 9e. McGraw Hill. [BNP15] Branke, J., Nguyen, S., Pickardt, C. W., & Zhang, M. (2015). Automated design of production scheduling heuristics: A review. IEEE Transactions on Evolutionary Computation, 20(1), 110-124. [BPS18] Bastani, O., Pu, Y., & Solar-Lezama, A. (2018). Verifiable reinforcement learning via policy extraction. Advances in neural information processing systems, 31. [BSS23] Beechey, D., Smith, T. M., & Şimşek, Ö. (2023, July). Explaining reinforcement learning with shapley values. In International Conference on Machine Learning (pp. 2003-2014). PMLR. [CEL19] Coppens, Y., Efthymiadis, K., Lenaerts, T., Nowé, A., Miller, T., Weber, R., & Magazzeni, D. (2019, August). Distilling deep reinforcement learning policies in soft decision trees. In Proceedings of the IJCAI 2019 workshop on explainable artificial intelligence (pp. 1-6). [Cha02] Chang, Y. R. (2002). A learning agent for supervisors of semiconductor tool dispatching. National Taiwan University Master Thesis. Taipei. [ChL21] Chien, C. F., & Lan, Y. B. (2021). Agent-based approach integrating deep reinforcement learning and hybrid genetic algorithm for dynamic scheduling for Industry 3.5 smart production. Computers & Industrial Engineering, 162, 107782. [Cho07] Chou, Y. C. (2007, October). Using capacity as a competition strategy in a manufacturing duopoly. In Proceedings of 2007 International Symposium on Semiconductor Manufacturing (pp. 1-4). IEEE. [ChS01] Chryssolouris, G., & Subramaniam, V. (2001). Dynamic scheduling of manufacturing job shops using genetic algorithms. Journal of Intelligent Manufacturing, 12, 281-293. [Dev95] Devore, J. L. (1995). Probability and Statistics for Engineering and the Sciences (Vol. 5). Belmont: Duxbury Press. [EIS20] Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., & Madry, A. (2020). Implementation matters in deep policy gradients: A case study on PPO and TRPO. arXiv preprint arXiv:2005.12729. [GKU97] Geiger, C. D., Kempf, K. G., & Uzsoy, R. (1997). A tabu search approach to scheduling an automated wet etch station. Journal of Manufacturing Systems, 16(2), 102-116. [GrT99] Grubbström, R. W., & Tang, O. (1999). Further developments on safety stocks in an MRP system applying Laplace transforms and input–output analysis. International journal of production economics, 60, 381-387. [GuS06] Gupta, A. K., & Sivakumar, A. I. (2006). Job shop scheduling techniques in semiconductor manufacturing. The International Journal of Advanced Manufacturing Technology, 27, 1163-1169. [GWK21] Guo, W., Wu, X., Khan, U., & Xing, X. (2021). Edge: Explaining deep reinforcement learning policies. Advances in Neural Information Processing Systems, 34, 12222-12236. [HCC07] Hsieh, B. W., Chen, C. H., & Chang, S. C. (2007). Efficient simulation-based composition of scheduling policies by integrating ordinal optimization with design of experiment. IEEE Transactions on Automation Science and Engineering, 4(4), 553-568. [HDY22] Huang, S., Dossa, R. F. J., Ye, C., Braga, J., Chakraborty, D., Mehta, K., & Araújo, J. G. (2022). CleanRL: High-quality single-file implementations of deep reinforcement learning algorithms. Journal of Machine Learning Research, 23(274), 1-18. [ILL18] Iyer, R., Li, Y., Li, H., Lewis, M., Sundar, R., & Sycara, K. (2018, December). Transparency and explanation in deep reinforcement learning neural networks. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society (pp. 144-150). [JKF19] Juozapaitis, Z., Koul, A., Fern, A., Erwig, M., & Doshi-Velez, F. (2019, January). Explainable reinforcement learning via reward decomposition. In IJCAI/ECAI Workshop on explainable artificial intelligence. [Kao18] Kao, Y. T. (2019). Integrated short-interval scheduling for productivity and quality. National Taiwan University Doctoral Dissertation. Taipei. [KKT21] Kuhnle, A., Kaiser, J. P., Theiß, F., Stricker, N., & Lanza, G. (2021). Designing an adaptive production control system using reinforcement learning. Journal of Intelligent Manufacturing, 32, 855-876. [KMS22] Kuhnle, A., May, M. C., Schäfer, L., & Lanza, G. (2022). Explainable reinforcement learning in production control of job shop manufacturing system. International Journal of Production Research, 60(19), 5812-5834. [KSS19] Kuhnle, A., Schäfer, L., Stricker, N., & Lanza, G. (2019). Design, implementation and evaluation of reinforcement learning for an adaptive order dispatching in job shop manufacturing systems. Procedia CIRP, 81, 234-239. [KZC11] Kao, Y. T., Zhan, S. C., Chang, S. C., Ho, J. H., Wang, P., Luh, P. B., ... & Chang, J. (2011, August). Near optimal furnace tool allocation with batching and waiting time constraints. In Proceedings of 2011 IEEE International Conference on Automation Science and Engineering (pp. 108-113). IEEE. [LCC02] Lin, Y. T., Chang, S. C., Chien, M., Chang, C. H., Hsu, K. H., Chen, H. R., & Hsieh, B. W. (2002, October). Design and implementation of a knowledge representation model for tool dispatching. In Proceedings of International Symposium on Semiconductor Manufacturing (pp. 109-112). [LiC00] Liu, C. Y., & Chang, S. C. (2000). Scheduling flexible flow shops with sequence-dependent setup effects. IEEE Transactions on Robotics and Automation, 16(4), 408-419. [Lin02] Lin, Y. T. (2002). Design and implementation of a dispatching knowledge representation model for semiconductor tools. National Taiwan University Master Thesis. Taipei. [MLW23] Cost-complexity pruning. (2023). ML Wiki. http://mlwiki.org/index.php/Cost-Complexity_Pruning [MSP10] Metan, G., Sabuncuoglu, I., & Pierreval, H. (2010). Real time selection of scheduling rules and knowledge extraction via dynamically controlled data mining. International Journal of Production Research, 48(23), 6909-6938. [Ope18] Part 2: kinds of RL algorithms. (2018). OpenAI Spinning Up. https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html#citations-below [OuP09] Ouelhadj, D., & Petrovic, S. (2009). A survey of dynamic scheduling in manufacturing systems. Journal of scheduling, 12, 417-431. [PaP21] Park, I. B., & Park, J. (2021). Scalable scheduling of semiconductor packaging facilities using deep reinforcement learning. IEEE Transactions on Cybernetics. [PGP14] Priore, P., Gomez, A., Pino, R., & Rosillo, R. (2014). Dynamic scheduling of manufacturing systems using machine learning: An updated review. Ai Edam, 28(1), 83-97. [PHK19] Park, I. B., Huh, J., Kim, J., & Park, J. (2019). A reinforcement learning approach to robust scheduling of semiconductor manufacturing facilities. IEEE Transactions on Automation Science and Engineering, 17(3), 1420-1431. [PTL18] Pardo, F., Tavakoli, A., Levdik, V., & Kormushev, P. (2018, July). Time limits in reinforcement learning. In International Conference on Machine Learning (pp. 4045-4054). PMLR. [PVG11] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of machine Learning research, 12, 2825-2830. [Sci23] scikit-learn. (2023). Decision Tree Cost Complexity Pruning. scikit-learn. https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html [SKS18] Stricker, N., Kuhnle, A., Sturm, R., & Friess, S. (2018). Reinforcement learning for adaptive order dispatching in the semiconductor industry. CIRP Annals, 67(1), 511-514. [SuB18] Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: an introduction. MIT press. [SWD17] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. [WAB16] Waschneck, B., Altenmüller, T., Bauernhansl, T., & Kyek, A. (2016). Production Scheduling in Complex Job Shops from an Industry 4.0 Perspective: A Review and Challenges in the Semiconductor Industry. SAMI@ iKNOW, 1793. [WRB18] Waschneck, B., Reichstaller, A., Belzner, L., Altenmüller, T., Bauernhansl, T., Knapp, A., & Kyek, A. (2018). Optimization of global production scheduling with deep reinforcement learning. Procedia Cirp, 72, 1264-1269. [Wik23a] Goodness of fit. (2023, August 13). In Wikipedia. https://en.wikipedia.org/wiki/Goodness_of_fit [Wik23b] Akaike information criterion. (2023, November 27). In Wikipedia. https://en.wikipedia.org/wiki/Akaike_information_criterion [Wik23c] Softmax function. (2023, July 25). In Wikipedia. https://en.wikipedia.org/ wiki/Softmax_function [Wik23d] Hungarian algorithm. (2023, July 11). In Wikipedia, the free encyclopedia. Retrieved July 31, 2023, https://en.wikipedia.org/wiki/Hungarian_algorithm [Wik23e] Elbow method (clustering). (2023, July 29). In Wikipedia, the free encyclopedia. Retrieved January 2, 2024, https://en.wikipedia.org/wiki/Elbow_method_(clustering) [YaG14] Yates, R. D., & Goodman, D. J. (2014). Probability and stochastic processes: a friendly introduction for electrical and computer engineers. John Wiley & Sons. [YCL13] Yan, B., Chen, H. Y., Luh, P. B., Wang, S., & Chang, J. (2013). Litho machine scheduling with convex hull analyses. IEEE Transactions on Automation Science and Engineering, 10(4), 928-937. [ZhW23] Zhang, B., & Wu, C. H. (2023). Joint dynamic dispatching and preventive maintenance for unrelated parallel machines with equipment health considerations. IEEE Transactions on Semiconductor Manufacturing. [ZYC22] Zhu, Y., Yin, X., & Chen, C. (2022). eExtracting decision tree from trained deep reinforcement learning in traffic signal control. IEEE Transactions on Computational Social Systems.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91862	-
dc.description.abstract	西元2022台灣半導體晶圓代工產值為900億美元，佔全球產值的63%。生產排程之動態機台配置(Dynamic Machine Allocation, DMA)在晶圓製造扮演重要地位，目前產線實務採用DMA方法為根據工程師知識與經驗所建構的啟發式方法。儘管透過適當管理能夠實現卓越生產，然而也存在以下問題：(D1)良好實務仰賴經驗豐富的工程師；(D2)景氣過渡時期，人為配置策略需2至3天調整試驗無法即時反應產線變化且一日站點產出量與目標曾高達-39%的偏離；(D3)配置策略隱晦不明難以保存傳承。雖然目前有許多基於深度強化學習(Deep Reinforcement Learning, DRL)之DMA研究且展現生產效能相較啟發式方法能有效提升，但實際應用於產線仍受限於：(L1) DRL訓練資料產生；(L2) DRL訓練計算成本高、時間長；(L3)DRL所作決策不一定能保證遵守DMA限制；(L4) DRL為黑盒模型，工程師難以詮釋學到策略。基於上述議題，本研究提出基於DRL之高效且可解釋之DMA策略學習(Effective & Explainable Policy Learning for Dynamic Machine Allocation, ELMA)架構。為驗證所提出方法之有效性，本論文以真實半導體晶圓廠金屬濺鍍機群為研究案例，並建構模擬環境驗證。主要研究問題(P)、相應挑戰(C)及設計解決方案(M)如下: (P1) DRL訓練資料產生問題：如何建構符合問題需求之機台配置之機群生產環境模擬，並設計DRL之狀態、動作、獎勵函數以產生DRL訓練資料？ (C1) 模擬環境建構品質影響DRL學到的策略能否有效應用於真實世界，而DRL之狀態、動作、獎勵函數設計直接影響學習過程接收到的訊息和反饋，如何使模擬環境動態與DRL之設計貼近真實問題是個挑戰。 (M1) 將機群環境構建為離散事件模擬器，並基於實際數據擬合產線隨機動態事件的概率分佈模型，以於模擬環境中產生動態事件。此外，將研究案例機台配置問題數學化描述成最佳化問題MS-MAP，並根據MS-MAP之變數、決策變數與目標函數分別設計DRL狀態、動作與獎勵函數。 (P2) 策略學習效率提升問題：DMA問題可行動作空間龐大，直接用DRL求解新策略將導致效率低落，如何減少動作空間而仍能解原問題？ (C2) 以MS-MAP為例，其動作定義為在某時刻點配置某機台給某種加工任務，動作空間大小隨機台數增加成指數級增加，此例可達數億。 (M2) 設計「兩階段高效策略學習代理人(Two-stage Effective Policy Learning Agent, TELA)」將問題拆成兩階段並結合DRL與最佳化方法來解：  階段一：將機台依加工能力分類，利用DRL決定配置多少數量之某類機台給某加工任務，DRL所解問題之解空間大幅縮小。  階段二：根據階段一結果利用最佳化方法決定配置某機台給某加工任務。 (P3) 策略優化學習與決策動作可行問題：如何使TELA在學習優化策略與測試中所做決策動作遵守DMA限制：(a)機台加工能力、(b)配置機台數整數、(c)可配置機台數限制？ (C3) 常見處理不可行動作方法為在學習策略過程中在獎勵函數設計引入懲罰機制，然而無法保證實際測試時代理人不違背動作限制。 (M3) 設計DRL網路輸出定義為「某類機台對某種加工任務的偏好分數」並選擇合適連續動作學習演算法「近端策略優化方法(Proximal Policy Optimization，PPO)」以滿足(a)。且為符合(b)與(c)，設計參考當前可用機台數之離散化模組(Rough Allocation Action Transformation, RAT)轉換DRL網路輸出值。 (P4) 策略可解釋問題：如何詮釋TELA所學到的策略並轉譯成對於工程師可理解、清楚的知識表達模型？ (C4) DRL網路模型為黑盒，分析將習得策略轉換為工程師能理解形式頗具挑戰。 (M4) 收集TELA和模擬環境互動所產生之多筆(狀態, 動作)資料，利用其分析TELA策略並與常見的配置的策略對照比較。為將TELA策略轉譯為樹狀規則輔助工程師理解，設計策略解釋轉譯模組(Policy Explanation Translator, PET)將(狀態, 動作)資料作為訓練資料建構決策樹。本論文之研究發現與貢獻如下： 1. ELMA架構定義基於DRL之DMA策略學習、決策與人機協作解釋流程。ELMA具應用廣泛性能用於一般問題，不僅限應用於半導體金屬濺鍍機群。 2. TELA結合RAT可確保決策動作可行性，兩階段設計使DRL所解問題之解空間大幅縮小，以16台機器之研究案例為例可縮減99.99%，能提升訓練效率。TELA訓練時間約1小時，能及時因應日變化之日生產目標及產線改變。訓練好之TELA進行DMA決策時間<0.003秒能實現實時決策。 3. 為訓練及評估TELA效能，根據實際產線資料建構兩種情境下機群環境之離散事件模擬器，並與工程師所參照之經驗法則—常量參數配置法(Fix Constant Allocation, FCA)比較常見產線重點指標「每日站點產出量」： (S-1) 工程師熟悉情境：fab high WIP、機群高負載、產能全開 (S-2) 工程師較陌生情境：fab low WIP、機群高負載、部分停機測試於(S-1)，TELA較調校良好配置參數之FCA提升~3%產出量；於(S-2)測試，TELA能有效學習，比適用於(S-1)的舊配置參數FCA提升~20%產出量，TELA於兩種情境皆優於FCA因其策略學習特定於所設定重點指標，而實際工程師為FCA調參除重點指標外還需考量其他產線因素。此外情境從(S-1)轉換成(S-2)，將(S-1)訓練之TELA於(S-2)繼續訓練，發現TELA策略擬合於(S-2)而在(S-1)表現較差，若要應用於(S-1)則需再次訓練。 4. 透過(狀態, 動作)資料分析TELA策略並利用PET簡化動作種類數再轉化為決策樹輔助解釋策略，發現當獎勵函數設計為最大化每日站點產出量，TELA習得策略為在step 3 WIP充足時優先配置給較有效率增加站點產出量的step 3，待step 3 WIP消耗到某閥值以下，則配置更多機台給step 1以平衡step 3 WIP消耗與流入。相比常見依工作量(Workload)配置更符合所設定目標，且不同於工程師以最大化每日站點產出量為目標的經驗法則: 一天中先累積step 3 WIP再集中做step 3，TELA能提供新的配置策略知識。	zh_TW
dc.description.abstract	Semiconductor manufacturing is an important industry in Taiwan, which produced $90 billion USD in 2022, accounting for 63% of global production value. In semiconductor manufacturing, dynamic machine allocation (DMA) is crucial. Current practice is often based on heuristic methods built upon engineers' knowledge and experience, which exists the following issues: (D1) Good practices heavily rely on experienced engineers; (D2) During economic transitions, manual DMA policies need 2-3 days to adjust, unable to respond to production line changes in time, with daily stage moves deviating up to -39% from the target; (D3) DMA policies are often obscure and hard to preserve. Although many studies on DMA based on Deep Reinforcement Learning (DRL) have shown significant improvements in production efficiency compared to heuristic methods, practical application of DRL to DMA is limited by: (L1) Generation of DRL training data; (L2) High computational cost and time for DRL training; (L3) Decisions made by DRL may not always adhere to DMA constraints; (L4) DRL as a black-box model is hard for engineers to interpret learned policies. Based on these issues, this study proposes "Effective & Explainable Policy Learning for Dynamic Machine Allocation (ELMA)" architecture. This thesis utilizes a real semiconductor fab’s Metal Sputtering Machine Group as a case study and constructs a simulation environment for verification of ELMA's effectiveness. The primary research problems (P), corresponding challenges (C), and the designed solutions (M) are as follows: (P1) DRL Training Data Generation: How to construct an interactive environment simulation that reflects real production line scenarios, and design states, actions, and reward functions in DRL to generate training data? (C1) The quality of the simulation affects the effectiveness of DRL learned the policies applied in the real world. The design of states, actions, and reward functions directly influences the information and feedback received during the learning process. It’s challenging to align the dynamics of the simulation environment and the design of DRL closely with the real-world problem. (M1) Construct the machine group environment as a discrete event simulator, and study probability distribution fitting for random dynamic events from actual data. Moreover, formulate the machine allocation problem of the case study as an optimization problem MS-MAP, and design the DRL states, actions, and reward functions accordingly. (P2) DMA Policy Learning Efficiency: DRL is inefficient on solving DMA problems directly because of their large feasible action space. How to design an approach to reduce the DRL action space while still addressing the original problem? (C2) Taking MS-MAP as an example, an action is defined as allocating a certain machine to a specific processing task at a given moment. The size of the action space grows exponentially with the number of machines and can reach billions. (M2) Design "Two-stage Effective Policy Learning Agent (TELA)" to divide the problem into two stages and solve it by combining DRL with optimization methods: - Stage 1. DRL-based RMA: Classify machines according to their processing capabilities and use DRL to decide allocated numbers of a certain type of machine to a specific processing task, greatly reducing the action space. - Stage 2. OPT-based DEMA: Use optimization methods to decide which specific machine to allocate to a processing task based on Stage 1’s results. (P3) DMA Action Compliance and Policy Optimization: How to optimize policy of TELA and ensure that actions of TELA comply with constraints: (a) machine-processing task capability, (b) integer number of machines allocated, (c) number of available machines at decision time that can be allocated? (C3) Introducing penalty mechanisms in the reward function during learning of policies is commonly used to handle infeasible actions, while it does not guarantee that the agent will not violate action constraints during actual testing. (M3) Design the action definition of the DRL network output as the "preference score of a certain machine type for a specific processing task" and select the suitable algorithm for continuous action learning – Proximal Policy Optimization (PPO) method to satisfy (a). To meet (b) and (c), design a discretization module (Rough Allocation Action Transformation, RAT) that refers to the current number of available machines to transform the output values of the DRL network. (P4) Explainable DMA Policies: How to interpret the learned allocation policy and translate it into an explainable and clear knowledge representation model for engineers? (C4) The DRL model, composed of weights and biases, is a black box. Translating its learned policies into a representation understandable to engineers is challenging. (M4) Collect (state, action) data from the interaction between TELA and the simulation environment, and utilize the data to analyze the policy and compare it with common allocation policies. To translate TELA's policy into tree-like rules to help engineers understand it, we design a Policy Explanation Translator (PET) module that constructs a decision tree using the (state, action) data as training data. The findings, contributions, and values are as follows: 1. ELMA framework defines a DRL-based DMA process for policy learning, decision making, and human-machine collaboration and explanation. ELMA is applicable for general DMA problems, not limited to the Metal Sputtering Machine Group. 2. TELA ensures the action compliance by RAT, significantly reduces the action space by two-stage design with a 99.99% reduction in a 16-machine case study, and increases the training efficiency. TELA's training time is about 1 hour, enabling timely responses to daily production goals and production line changes. The trained TELA can make real-time DMA decisions in less than 0.003 seconds. 3. To train TELA and assess its performance, create a discrete event simulator for a machine group and develop two simulation scenarios using real production line data. We will compare TELA with the engineers' empirical rule – Fix Constant Allocation (FAC) – using the common production line KPI of daily stage moves. (S-1) Engineer's familiar scenario: fab high WIP, heavily loaded machine group with full capacity (S-2) Engineer's unfamiliar scenario: fab low WIP, heavily loaded machine group with partial machines shutdown In S-1, TELA allocates better by ~3% more daily stage moves than FCA with well-tuned allocation parameters. In S-2, TELA effectively learns under the new environment and enhances daily stage moves by ~20% than FCA with old allocation parameters to fit (S-1). TELA performs better than FCA in both scenarios because TELA's policy learning targets to the KPI. However, engineers need to consider production line factors other than the KPI when tuning FCA. Furthermore, when training TELA initially in (S-1) and continuing in (S-2), TELA's policy will fit to (S-2) and thus perform poorly in (S-1). This implies that retraining is need if applied to (S-1). 4. Analyze TELA's policy by (state, action) data and apply PET to simplify actions and translate into decision tree to assist policy explanation. We find that when the reward function is to maximize daily stage moves, TELA suggests prioritizing allocation to step 3 processing tasks, which can effectively increase stage moves, when step 3 WIP is enough. After step 3 WIP drops below a certain threshold, TELA suggests allocate more to step 1 to maintain consumption and flow in of step 3 WIP. Comparing to the common workload allocation policy, TELAs policy fits the production goal better. Also, the engineers' empirical rules to maximize daily stage moves is to pile up in step3 WIP and then deplete step3 WIP. TELAs policy differs from the empirical rules, which provides engineers with new allocation knowledge.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-02-23T16:20:31Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-02-23T16:20:32Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	致謝 i 中文摘要 iii Abstract vi List of Terms and Abbreviations xii Table of Contents xiv List of Figures xx List of Tables xxiv Chapter 1 Introduction 1 1.1 Motivation: Demands for Dynamic Machine Allocation Improvement 1 1.2 Literature Review 2 1.3 Scope of Thesis 7 1.4 Organization of Thesis 9 Chapter 2 DRL for DMA in a Fab: Problem Definitions 10 2.1 MA in a Fab 11 2.1.1 Role of Machine Allocation in Fab Production Line with Uncertainty 11 2.1.2 General Machine Allocation Problem 13 2.1.3 Mathematical Optimization Formulation for Case Study Metal Sputtering Machine Group 14 2.2 DRL for DMA Policy Learning 21 2.2.1 Leveraging MDP and DRL for DMA 22 2.2.2 Limitations in Current DRL Research 25 2.3 Framework Design for DRL-based Effective & Explainable Policy Learning of Dynamic Machine Allocation (ELMA) 27 2.3.1 Functionality and Design Idea of ELMA's Modules 28 2.3.2 Procedure Control and Module Interaction in ELMA 31 2.4 Research Problem Definitions and Challenges 33 2.4.1 Problem Definitions 34 2.4.2 Challenges 35 Chapter 3 DRL-based DMA Training Data Generation with Machine Group Environment Simulator 38 3.1 Interaction among DRL Agent, Environment Simulator and Policy Learning Algorithm 39 3.2 State, Action and Reward Design for DMA in Case Study Machine Group 42 3.3 Machine Group Environment Simulator Construction 47 3.3.1 Construct Machine Group Environment as Discrete-Event Simulation 48 3.3.2 System Events in DMA Simulation Environment 53 3.3.3 Stochastic Models for Uncertain Lot Flow-In and Machine Failure 56 3.4 Summary 59 Chapter 4 Two-Stage Effective DMA Policy Learning 61 4.1 Design Concept of Two-Stage Effective DMA Policy Learning 62 4.1.1 Partition Original DMA Decision Problem into Two Sub-problems 63 4.1.2 Analysis and Discussion for Two-Stage Solution 66 4.2 Two-Stage Effective Policy Learning Agent (TELA) 68 4.2.1 Overview of TELA 69 4.2.2 TELA Stage1. DRL-Based Rough Machine Allocation (DRL-Based RMA) 70 4.2.1.1 DRL Policy Network Model 73 4.2.1.2 Rough Allocation Action Transformation (RAT) 75 4.2.3 TELA Stage2. Optimization-based DEtailed Machine Allocation (OPT-based DEMA) 77 4.3 DMA Policy Learning Algorithm 81 4.4 Illustrative Example for TELA Evaluation 83 4.5 Summary 85 Chapter 5 Explanation for DRL-Based DMA 87 5.1 Selecting Explanation Approach for DRL Learned DMA Policy 88 5.1.1 Literature Survey 88 5.1.2 Approach Selection 89 5.2 Decision Tree-Based Explanation Approach and Challenges 90 5.2.1 Existing Decision Tree-Based Mimic Learning Explanation 91 5.2.2 Challenges in the Decision Tree-based Explanation 94 5.3 Proposed Decision Tree-Based Policy Explanation Translator (PET) Design to Overcome the Challenges 95 5.3.1 Overview of PET Design 96 5.3.2 Trajectory Data Collection as Training Data 97 5.3.3 Training Data Preprocessing 98 5.3.4 Learning Algorithm of PET 105 5.4 Illustrative Example for Policy Explanation Translator Evaluation 110 5.5 Summary 112 Chapter 6 Implementation and Experiment Results 114 6.1 Hardware Specifications and Software Stack for Implementing ELMA 114 6.2 Experiment Scenarios and Settings 119 6.2.1 Experiment Scenarios and Simulation Environment Parameters 120 6.2.2 Test Data Generation 122 6.2.3 Configuration of DRL Agents and Learning Parameters 124 6.2.4 Benchmarks 125 6.3 Performance Evaluation of TELA 127 6.3.1 Allocation Performance of TELA 127 6.3.1.1 [Exp1] Performance of TELA under S-1 High WIP High Load 128 6.3.1.2 [Exp2] Performance of TELA under S-2 Low WIP High Load 136 6.3.1.3 [Exp3] Different Reward Functions 142 6.3.1.4 [Exp4] Performance of TELA Trained under Scenario Transition 144 6.3.2 Training Time Discussions 148 6.4 Analysis Learned TELA’s DMA Policies and Interpretation with PET 150 6.4.1 Analysis of TELA(M)’s DMA Policies 151 6.4.2 Explanation of TELA(M)’s DMA Policies with PET 160 Chapter 7 Conclusion 167 Appendix A. Configuration of Case Study Real Fab's Metal Sputtering Machine Group 171 Appendix B. Results of Illustrative Example 173 Appendix C. PPO Training Hyperparameter 174 Appendix D. Parameters of Optimal(M) in Exp1. 175 References 176	-
dc.language.iso	en	-
dc.title	高效訓練與策略可解釋之深度強化學習動態機台配置設計：以半導體金屬濺鍍機群為例	zh_TW
dc.title	Effective Training and Policy Explanation of DRL for Dynamic Machine Allocation: A Case Study in Semiconductor Metal Sputtering Machine Group	en
dc.type	Thesis	-
dc.date.schoolyear	112-1	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	邱哲煜;李登豐;范治民;吳政鴻	zh_TW
dc.contributor.oralexamcommittee	Che-Yu Chiu;Teng-Fong Li;Chi-Min Fan;Cheng-Hung Wu	en
dc.subject.keyword	半導體排程,機台配置,深度強化學習,高效訓練,可解釋人工智慧,	zh_TW
dc.subject.keyword	Semiconductor Scheduling,Machine Allocation,Deep Reinforcement Learning,Effective Training,Explainable Artificial Intelligence,	en
dc.relation.page	184	-
dc.identifier.doi	10.6342/NTU202304537	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2024-01-23	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	電機工程學系	-
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-112-1.pdf	7.98 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。