高效訓練與策略可解釋之深度強化學習動態機台配置設計：以半導體金屬濺鍍機群為例

許心慈; Hsin-Tzu Hsu

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91862

標題:	高效訓練與策略可解釋之深度強化學習動態機台配置設計：以半導體金屬濺鍍機群為例 Effective Training and Policy Explanation of DRL for Dynamic Machine Allocation: A Case Study in Semiconductor Metal Sputtering Machine Group
作者:	許心慈 Hsin-Tzu Hsu
指導教授:	張時中 Shi-Chung Chang
關鍵字:	半導體排程,機台配置,深度強化學習,高效訓練,可解釋人工智慧, Semiconductor Scheduling,Machine Allocation,Deep Reinforcement Learning,Effective Training,Explainable Artificial Intelligence,
出版年 :	2024
學位:	碩士
摘要:	西元2022台灣半導體晶圓代工產值為900億美元，佔全球產值的63%。生產排程之動態機台配置(Dynamic Machine Allocation, DMA)在晶圓製造扮演重要地位，目前產線實務採用DMA方法為根據工程師知識與經驗所建構的啟發式方法。儘管透過適當管理能夠實現卓越生產，然而也存在以下問題：(D1)良好實務仰賴經驗豐富的工程師；(D2)景氣過渡時期，人為配置策略需2至3天調整試驗無法即時反應產線變化且一日站點產出量與目標曾高達-39%的偏離；(D3)配置策略隱晦不明難以保存傳承。雖然目前有許多基於深度強化學習(Deep Reinforcement Learning, DRL)之DMA研究且展現生產效能相較啟發式方法能有效提升，但實際應用於產線仍受限於：(L1) DRL訓練資料產生；(L2) DRL訓練計算成本高、時間長；(L3)DRL所作決策不一定能保證遵守DMA限制；(L4) DRL為黑盒模型，工程師難以詮釋學到策略。基於上述議題，本研究提出基於DRL之高效且可解釋之DMA策略學習(Effective & Explainable Policy Learning for Dynamic Machine Allocation, ELMA)架構。為驗證所提出方法之有效性，本論文以真實半導體晶圓廠金屬濺鍍機群為研究案例，並建構模擬環境驗證。主要研究問題(P)、相應挑戰(C)及設計解決方案(M)如下: (P1) DRL訓練資料產生問題：如何建構符合問題需求之機台配置之機群生產環境模擬，並設計DRL之狀態、動作、獎勵函數以產生DRL訓練資料？ (C1) 模擬環境建構品質影響DRL學到的策略能否有效應用於真實世界，而DRL之狀態、動作、獎勵函數設計直接影響學習過程接收到的訊息和反饋，如何使模擬環境動態與DRL之設計貼近真實問題是個挑戰。 (M1) 將機群環境構建為離散事件模擬器，並基於實際數據擬合產線隨機動態事件的概率分佈模型，以於模擬環境中產生動態事件。此外，將研究案例機台配置問題數學化描述成最佳化問題MS-MAP，並根據MS-MAP之變數、決策變數與目標函數分別設計DRL狀態、動作與獎勵函數。 (P2) 策略學習效率提升問題：DMA問題可行動作空間龐大，直接用DRL求解新策略將導致效率低落，如何減少動作空間而仍能解原問題？ (C2) 以MS-MAP為例，其動作定義為在某時刻點配置某機台給某種加工任務，動作空間大小隨機台數增加成指數級增加，此例可達數億。 (M2) 設計「兩階段高效策略學習代理人(Two-stage Effective Policy Learning Agent, TELA)」將問題拆成兩階段並結合DRL與最佳化方法來解：  階段一：將機台依加工能力分類，利用DRL決定配置多少數量之某類機台給某加工任務，DRL所解問題之解空間大幅縮小。  階段二：根據階段一結果利用最佳化方法決定配置某機台給某加工任務。 (P3) 策略優化學習與決策動作可行問題：如何使TELA在學習優化策略與測試中所做決策動作遵守DMA限制：(a)機台加工能力、(b)配置機台數整數、(c)可配置機台數限制？ (C3) 常見處理不可行動作方法為在學習策略過程中在獎勵函數設計引入懲罰機制，然而無法保證實際測試時代理人不違背動作限制。 (M3) 設計DRL網路輸出定義為「某類機台對某種加工任務的偏好分數」並選擇合適連續動作學習演算法「近端策略優化方法(Proximal Policy Optimization，PPO)」以滿足(a)。且為符合(b)與(c)，設計參考當前可用機台數之離散化模組(Rough Allocation Action Transformation, RAT)轉換DRL網路輸出值。 (P4) 策略可解釋問題：如何詮釋TELA所學到的策略並轉譯成對於工程師可理解、清楚的知識表達模型？ (C4) DRL網路模型為黑盒，分析將習得策略轉換為工程師能理解形式頗具挑戰。 (M4) 收集TELA和模擬環境互動所產生之多筆(狀態, 動作)資料，利用其分析TELA策略並與常見的配置的策略對照比較。為將TELA策略轉譯為樹狀規則輔助工程師理解，設計策略解釋轉譯模組(Policy Explanation Translator, PET)將(狀態, 動作)資料作為訓練資料建構決策樹。本論文之研究發現與貢獻如下： 1. ELMA架構定義基於DRL之DMA策略學習、決策與人機協作解釋流程。ELMA具應用廣泛性能用於一般問題，不僅限應用於半導體金屬濺鍍機群。 2. TELA結合RAT可確保決策動作可行性，兩階段設計使DRL所解問題之解空間大幅縮小，以16台機器之研究案例為例可縮減99.99%，能提升訓練效率。TELA訓練時間約1小時，能及時因應日變化之日生產目標及產線改變。訓練好之TELA進行DMA決策時間<0.003秒能實現實時決策。 3. 為訓練及評估TELA效能，根據實際產線資料建構兩種情境下機群環境之離散事件模擬器，並與工程師所參照之經驗法則—常量參數配置法(Fix Constant Allocation, FCA)比較常見產線重點指標「每日站點產出量」： (S-1) 工程師熟悉情境：fab high WIP、機群高負載、產能全開 (S-2) 工程師較陌生情境：fab low WIP、機群高負載、部分停機測試於(S-1)，TELA較調校良好配置參數之FCA提升~3%產出量；於(S-2)測試，TELA能有效學習，比適用於(S-1)的舊配置參數FCA提升~20%產出量，TELA於兩種情境皆優於FCA因其策略學習特定於所設定重點指標，而實際工程師為FCA調參除重點指標外還需考量其他產線因素。此外情境從(S-1)轉換成(S-2)，將(S-1)訓練之TELA於(S-2)繼續訓練，發現TELA策略擬合於(S-2)而在(S-1)表現較差，若要應用於(S-1)則需再次訓練。 4. 透過(狀態, 動作)資料分析TELA策略並利用PET簡化動作種類數再轉化為決策樹輔助解釋策略，發現當獎勵函數設計為最大化每日站點產出量，TELA習得策略為在step 3 WIP充足時優先配置給較有效率增加站點產出量的step 3，待step 3 WIP消耗到某閥值以下，則配置更多機台給step 1以平衡step 3 WIP消耗與流入。相比常見依工作量(Workload)配置更符合所設定目標，且不同於工程師以最大化每日站點產出量為目標的經驗法則: 一天中先累積step 3 WIP再集中做step 3，TELA能提供新的配置策略知識。 Semiconductor manufacturing is an important industry in Taiwan, which produced $90 billion USD in 2022, accounting for 63% of global production value. In semiconductor manufacturing, dynamic machine allocation (DMA) is crucial. Current practice is often based on heuristic methods built upon engineers' knowledge and experience, which exists the following issues: (D1) Good practices heavily rely on experienced engineers; (D2) During economic transitions, manual DMA policies need 2-3 days to adjust, unable to respond to production line changes in time, with daily stage moves deviating up to -39% from the target; (D3) DMA policies are often obscure and hard to preserve. Although many studies on DMA based on Deep Reinforcement Learning (DRL) have shown significant improvements in production efficiency compared to heuristic methods, practical application of DRL to DMA is limited by: (L1) Generation of DRL training data; (L2) High computational cost and time for DRL training; (L3) Decisions made by DRL may not always adhere to DMA constraints; (L4) DRL as a black-box model is hard for engineers to interpret learned policies. Based on these issues, this study proposes "Effective & Explainable Policy Learning for Dynamic Machine Allocation (ELMA)" architecture. This thesis utilizes a real semiconductor fab’s Metal Sputtering Machine Group as a case study and constructs a simulation environment for verification of ELMA's effectiveness. The primary research problems (P), corresponding challenges (C), and the designed solutions (M) are as follows: (P1) DRL Training Data Generation: How to construct an interactive environment simulation that reflects real production line scenarios, and design states, actions, and reward functions in DRL to generate training data? (C1) The quality of the simulation affects the effectiveness of DRL learned the policies applied in the real world. The design of states, actions, and reward functions directly influences the information and feedback received during the learning process. It’s challenging to align the dynamics of the simulation environment and the design of DRL closely with the real-world problem. (M1) Construct the machine group environment as a discrete event simulator, and study probability distribution fitting for random dynamic events from actual data. Moreover, formulate the machine allocation problem of the case study as an optimization problem MS-MAP, and design the DRL states, actions, and reward functions accordingly. (P2) DMA Policy Learning Efficiency: DRL is inefficient on solving DMA problems directly because of their large feasible action space. How to design an approach to reduce the DRL action space while still addressing the original problem? (C2) Taking MS-MAP as an example, an action is defined as allocating a certain machine to a specific processing task at a given moment. The size of the action space grows exponentially with the number of machines and can reach billions. (M2) Design "Two-stage Effective Policy Learning Agent (TELA)" to divide the problem into two stages and solve it by combining DRL with optimization methods: - Stage 1. DRL-based RMA: Classify machines according to their processing capabilities and use DRL to decide allocated numbers of a certain type of machine to a specific processing task, greatly reducing the action space. - Stage 2. OPT-based DEMA: Use optimization methods to decide which specific machine to allocate to a processing task based on Stage 1’s results. (P3) DMA Action Compliance and Policy Optimization: How to optimize policy of TELA and ensure that actions of TELA comply with constraints: (a) machine-processing task capability, (b) integer number of machines allocated, (c) number of available machines at decision time that can be allocated? (C3) Introducing penalty mechanisms in the reward function during learning of policies is commonly used to handle infeasible actions, while it does not guarantee that the agent will not violate action constraints during actual testing. (M3) Design the action definition of the DRL network output as the "preference score of a certain machine type for a specific processing task" and select the suitable algorithm for continuous action learning – Proximal Policy Optimization (PPO) method to satisfy (a). To meet (b) and (c), design a discretization module (Rough Allocation Action Transformation, RAT) that refers to the current number of available machines to transform the output values of the DRL network. (P4) Explainable DMA Policies: How to interpret the learned allocation policy and translate it into an explainable and clear knowledge representation model for engineers? (C4) The DRL model, composed of weights and biases, is a black box. Translating its learned policies into a representation understandable to engineers is challenging. (M4) Collect (state, action) data from the interaction between TELA and the simulation environment, and utilize the data to analyze the policy and compare it with common allocation policies. To translate TELA's policy into tree-like rules to help engineers understand it, we design a Policy Explanation Translator (PET) module that constructs a decision tree using the (state, action) data as training data. The findings, contributions, and values are as follows: 1. ELMA framework defines a DRL-based DMA process for policy learning, decision making, and human-machine collaboration and explanation. ELMA is applicable for general DMA problems, not limited to the Metal Sputtering Machine Group. 2. TELA ensures the action compliance by RAT, significantly reduces the action space by two-stage design with a 99.99% reduction in a 16-machine case study, and increases the training efficiency. TELA's training time is about 1 hour, enabling timely responses to daily production goals and production line changes. The trained TELA can make real-time DMA decisions in less than 0.003 seconds. 3. To train TELA and assess its performance, create a discrete event simulator for a machine group and develop two simulation scenarios using real production line data. We will compare TELA with the engineers' empirical rule – Fix Constant Allocation (FAC) – using the common production line KPI of daily stage moves. (S-1) Engineer's familiar scenario: fab high WIP, heavily loaded machine group with full capacity (S-2) Engineer's unfamiliar scenario: fab low WIP, heavily loaded machine group with partial machines shutdown In S-1, TELA allocates better by ~3% more daily stage moves than FCA with well-tuned allocation parameters. In S-2, TELA effectively learns under the new environment and enhances daily stage moves by ~20% than FCA with old allocation parameters to fit (S-1). TELA performs better than FCA in both scenarios because TELA's policy learning targets to the KPI. However, engineers need to consider production line factors other than the KPI when tuning FCA. Furthermore, when training TELA initially in (S-1) and continuing in (S-2), TELA's policy will fit to (S-2) and thus perform poorly in (S-1). This implies that retraining is need if applied to (S-1). 4. Analyze TELA's policy by (state, action) data and apply PET to simplify actions and translate into decision tree to assist policy explanation. We find that when the reward function is to maximize daily stage moves, TELA suggests prioritizing allocation to step 3 processing tasks, which can effectively increase stage moves, when step 3 WIP is enough. After step 3 WIP drops below a certain threshold, TELA suggests allocate more to step 1 to maintain consumption and flow in of step 3 WIP. Comparing to the common workload allocation policy, TELAs policy fits the production goal better. Also, the engineers' empirical rules to maximize daily stage moves is to pile up in step3 WIP and then deplete step3 WIP. TELAs policy differs from the empirical rules, which provides engineers with new allocation knowledge.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91862
DOI:	10.6342/NTU202304537
全文授權:	同意授權(全球公開)
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-112-1.pdf	7.98 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。