Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97512
標題: 針對現實場景的策略學習的發展與監控
Development and Monitoring of Policy Learning for Real-world Scenarios
作者: 葉佳峯
Jia-Fong Yeh
指導教授: 徐宏民
Winston H. Hsu
關鍵字: 機器人學習,具身智能,策略學習,獎勵生成,行為監控,
Robot Learning,Embodied AI,Policy Learning,Reward Generation,Behavior Monitoring,
出版年 : 2025
學位: 博士
摘要: 策略學習是機器人學習中的重要領域,旨在為機器人尋找能有效完成任務的策略。近年來,策略學習已有許多應用,但在將其部署於真實場景中的機器人時,仍面臨諸多挑戰:現實場景變化多端且干擾因素眾多,現實環境中缺乏即時獎勵回饋以協助策略評估其表現,以及機器人在現實環境中發生失誤可能帶來嚴重的安全隱患。這些挑戰限制了策略學習的進一步發展。因此,本論文針對這些挑戰進行分析,並提出解決方案,期望能促進策略學習在真實世界中的應用落地。

針對訓練與部署環境差異的挑戰,我們研究少樣本模仿學習任務,旨在僅使用少量示範即可適應新場域。我們開發了一種策略,能夠應對多階段操作任務、示範長度不一致且關鍵資訊未時間對齊,以及示範者與機器人結構或外觀不同的情況。為此,我們設計了階段意識注意力模型,以分析機器人的當前階段並關注示範中相應階段的資訊,並採用基於示範的條件策略學習專家與機器人之間的動作映射。在實驗中,我們的方法在兩階段及三階段任務中的表現均優於其他少樣本學習方法,並展現對示範品質的高度魯棒性。

為解決缺乏即時獎勵的挑戰,我們探索了將長時間操作任務進行階層式拆解的方法,並開發基於視覺觀察與任務指令關聯的獎勵生成函數。我們利用大型語言模型的推理能力拆解任務並分析環境中物體的變化,透過階段判斷器定位所屬階段後,利用大型視覺語言模型評估機器人當前動作及完成程度。此外,我們設計了多個對比學習目標來輔助模型訓練。此階層化拆解方法使獎勵生成模型能夠提供細緻的獎勵訊號,幫助強化學習方法精確掌握任務進度。在相同的強化學習框架下,搭配我們的獎勵生成模型可完成更具挑戰性的長時間操作任務。

最後,為應對策略在新環境中的安全性挑戰,我們研究了策略行為監控的方法,確保其行為與示範意圖一致。我們定義了自適應錯誤偵測任務,並設計了一種基於模式分析的錯誤偵測模型,用於判斷策略抽取的特徵是正常還是異常。我們進一步引入了兩個對比學習目標以提升模型學習效果。實驗結果顯示,在我們構建的基準上,此錯誤偵測模型能精準地及時發現錯誤,並在七個任務與三種策略測試中展現最佳效能。同時,結合多個錯誤偵測器與修正策略後,僅我們的方法能有效偵測並修正錯誤,從而提升策略表現。

本論文旨在加速策略學習在真實環境中的應用,針對少樣本模仿學習、視覺與指令的獎勵生成模型以及策略行為異常監控等挑戰提出創新任務與方法,並取得超越現有方法的成果。我們亦探討了未來值得關注的方向,希望為策略學習研究的發展與突破提供更多啟發。
Policy learning is a crucial topic in robotics, aiming to develop policies that enable robots to effectively accomplish tasks. While policy learning has seen significant advancements and applications in recent years, deploying it in practical robotic applications still faces several challenges: the variability and complexity of real-world environments, the lack of reward feedback to guide the policy during training, and the potential for robot execution failures to cause serious safety concerns. These challenges limit the progress of policy learning. To address this, this dissertation examines these challenges in depth and proposes solutions to accelerate the adoption of policy learning in practical applications.

To tackle the challenge of adapting to deployment environments different from training ones, we investigate few-shot imitation learning tasks that require adaptation to new domains with only a few demonstrations. We develop a policy that addresses multi-stage manipulation tasks, handles demonstrations of varying lengths with temporally misaligned key information, and bridges configuration or appearance differences between demonstrators and agents. To this end, we designed a stage-conscious attention model to analyze the robot’s current stage and focus on the corresponding stage information in the demonstrations. Additionally, we employed a demonstration-conditioned policy to learn the mapping between expert and agent actions. Experiments show that our method outperforms other few-shot imitation learning approaches in both two- and three-stage tasks and demonstrates superior robustness to demonstration quality.

For the challenge of lacking reward functions, we propose a hierarchical approach to decompose long-horizon tasks and develop a reward generation model based on the correlation between visual observations and task instructions. Leveraging the reasoning capabilities of large language models, we decompose tasks and analyze changes in object states in the environment. After identifying the task stage using a stage detector, we use a large vision-language model to evaluate the robot’s current motion and its progress. We also designed multiple contrastive learning objectives to aid model training. This hierarchical decomposition enables our reward model to provide fine-grained reward signals, offering reinforcement learning methods precise information on task progress. Using the same reinforcement learning method, training with our reward model achieves better performance on challenging long-horizon tasks.

Finally, addressing the policy's safety concerns in novel environments, we explore how to monitor policy behavior to ensure it remains consistent with the intent demonstrated. We define the adaptable error detection task and design a pattern-explored error detection model to classify policy features as normal or abnormal. Two contrastive learning objectives were introduced to enhance model training. On the benchmarks we constructed, our error detection model identifies errors with precise timing and achieves the best performance across seven tasks and three policies. Moreover, when integrating various error detectors with error correction policies, only the integration with our model effectively detects and corrects errors, improving policy performance.

This dissertation aims to advance the practical development of policy learning by addressing challenges in few-shot imitation learning, vision- and instruction-based reward generation, and policy erroneous behavior detection. We propose novel tasks and methods that achieve state-of-the-art performance while surpassing existing approaches. Additionally, we discuss promising future directions to inspire further advancements and breakthroughs in policy learning research.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/97512
DOI: 10.6342/NTU202500305
全文授權: 同意授權(限校園內公開)
電子全文公開日期: 2025-07-03
顯示於系所單位:資訊工程學系

文件中的檔案:
檔案 大小格式 
ntu-113-2.pdf
授權僅限NTU校內IP使用(校園外請利用VPN校外連線服務)
25.07 MBAdobe PDF
顯示文件完整紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved