AR遠端會議中的即時摘要與重播助理

陳裕翔; Yu-Hsiang Chen

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99418

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	傅立成	zh_TW
dc.contributor.advisor	Li-Chen Fu	en
dc.contributor.author	陳裕翔	zh_TW
dc.contributor.author	Yu-Hsiang Chen	en
dc.date.accessioned	2025-09-10T16:13:42Z	-
dc.date.available	2025-09-11	-
dc.date.copyright	2025-09-10	-
dc.date.issued	2025	-
dc.date.submitted	2025-08-04	-
dc.identifier.citation	[1] Agora. Agora real-time voice and video engagement, 2025. 26, 44 [2] J. Brooke. Sus-a quick and dirty usability scale. Usability evaluation in industry, 189(194):4–7, 1996. x, 47, 51, 65 [3] J. Chen, Z.Lv, S.Wu,K.QinghongLin,C.Song,D.Gao,J.-W.Liu,Z.Gao,D.Mao, and M. Z. Shou. Videollm-online: Online video large language model for streaming video, June 01, 2024 2024. CVPR 2024. 3, 4, 5 [4] J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, and Z. Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self knowledge distillation, February 01, 2024 2024. 18 [5] W.Cheng, E.Kim, andJ.H.Ko. Handdagt: Adenoisingadaptive graph transformer for 3d hand pose estimation, 2024. 9 [6] M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazaré, M. Lomeli, L. Hosseini, and H. Jégou. The faiss library, January 01, 2024 2024. 37 [7] H. Durrant-Whyte and T. Bailey. Simultaneous localization and mapping: part i. IEEE Robotics & Automation Magazine, 13(2):99–110, 2006. 8 [8] S. Fernández, M. Montagud, G. Cernigliaro, and D. Rincón. Multi-party holomeetings: Toward a new era of low-cost volumetric holographic meetings in virtual reality, June 01, 2022 2022. 42 [9] E. Games. Unreal engine: The most powerful real-time 3d creation tool, 2025. 11 [10] Google. Google meet: Online web and video conferencing calls, 2025. 1 [11] S. W. Greenwald, W. Corning, G. McDowell, P. Maes, and J. Belcher. Electrovr: an electrostatic playground for collaborative, simulation-based exploratory learning in immersive virtual reality, 2019. 3, 4, 5 [12] S. G. Hart and L. E. Staveland. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research, volume 52, pages 139–183. Elsevier, 1988. x, 47, 51, 53, 65 [13] J. T. Inc. Jorjin technologies, 2019. 50 [14] R. Johansen. GroupWare: Computer Support for Business Teams. The Free Press, 1988. 7 [15] G. Lee, Y. Yang, J. Healey, and D. Manocha. Since u been gone: Augmenting context-aware transcriptions for re-engaging in immersive vr meetings, March 01, 2025 2025. 3, 4, 5, 45, 46 [16] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, May 01, 2020 2020. 17, 37 [17] F. Li, R. Zhang, H. Zhang, Y. Zhang, B. Li, W. Li, Z. Ma, and C. Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models, July 01, 2024 2024. 15, 18, 37 [18] Meta. Orion ai glasses: The future of ar glasses technology, 2024. 1 [19] Microsoft. Microsoft teams- video conferencing, meetings, calling, 2025. 1 [20] R. Mur-Artal and J. D. Tardos. Orb-slam2: an open-source slam system for monocular, stereo and rgb-d cameras, October 01, 2016 2016. Accepted for publication in IEEE Transactions on Robotics; doi:10.1109/TRO.2017.2705103. 9 [21] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. Leoni Aleman, D.Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L. Brakman, G. Brockman, T. Brooks, M. Brundage, K. Button, T. Cai, R. Campbell, A. Cann, B. Carey, C. Carlson, R. Carmichael, B. Chan, C. Chang, F. Chantzis, D. Chen, S. Chen, R. Chen, J. Chen, M. Chen, B. Chess, C. Cho, C. Chu, H. W. Chung, D. Cummings, J. Currier, Y. Dai, C. Decareaux, T. Degry, N. Deutsch, D. Deville, A. Dhar, D. Dohan, S. Dowling, S. Dunning, A. Ecoffet, A. Eleti, T. Eloundou, D. Farhi, L. Fedus, N. Felix, S. Posada Fishman, J. Forte, I. Fulford, L. Gao, E. Georges, C. Gibson, V. Goel, T. Gogineni, G. Goh, R. Gontijo-Lopes, J. Gordon, M. Grafstein, S. Gray, R. Greene, J. Gross, S. S. Gu, Y. Guo, C. Hallacy, J. Han, J. Harris, Y. He, M. Heaton, J. Heidecke, C. Hesse, A. Hickey, W. Hickey, P. Hoeschele, B. Houghton, K. Hsu, S. Hu, X. Hu, J. Huizinga, S. Jain, S. Jain, et al. Gpt-4 technical report, March 01, 2023 2023. 2 [22] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M.J. Black. Expressive body capture: 3d hands, face, and body from a single image, April 01, 2019 2019. 11 [23] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, February 01, 2021 2021. CLIP. 14, 15, 37 [24] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervision, December 01, 2022 2022. 12, 36 [25] J. Romero, D. Tzionas, and M. J. Black. Embodied hands: Modeling and capturing hands and bodies together, January 01, 2022 2022. SIGGRAPH ASIA 2017; ACM Transactions on Graphics, Vol. 36, No. 6, Article 245. 9, 10 [26] U. Technologies. Unity real-time development platform, 2025. 11 [27] M. F. Ursu, M. Groen, M. Falelakis, M. Frantzis, V. Zsombori, and R. Kaiser. Orchestration: tv-like mixing grammars applied to video-communication for social groups, 2013. 42 [28] W. Weiss, R. Kaiser, and M. Falelakis. Orchestration for group videoconferencing: An interactive demonstrator, 2014. 42 [29] A.Yang,A.Li,B.Yang,B.Zhang,B.Hui,B.Zheng,B.Yu,C.Gao,C.Huang,C.Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. Li, T. Tang, W. Yin, X. Ren, X. Wang, X. Zhang, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Zhang, Y. Wan, Y. Liu, Z. Wang, Z. Cui, Z. Zhang, Z. Zhou, and Z. Qiu. Qwen3 technical report, May 01, 2025 2025. 13, 15, 18 [30] S. Yang, J. Yim, J. Kim, and H. V. Shin. Catchlive: Real-time summarization of live streams with stream content and interaction data, 2022. 3, 4, 5 [31] F. Zhang, V. Bazarevsky, A. Vakunov, A. Tkachenka, G. Sung, C.-L. Chang, and M.Grundmann. Mediapipehands: On-devicereal-timehandtracking, June01, 2020. CVPR Workshop on Computer Vision for Augmented and Virtual Reality, Seattle, WA, USA, 2020. 9, 10, 21, 22, 23	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/99418	-
dc.description.abstract	自疫情以後，遠端會議成為部分工作的常態，現在主流的會議系統如 Google Meet 與 Microsoft Teams 被廣泛使用。而隨著 AR/VR 技術的進步，虛擬會議能夠得到進一步的提升，並提供高度沉浸感與互動性的體驗給使用者。傳統會議中，通常會有人整理會議並紀錄以方便事後釐清內容；而現在視訊會議也有逐字稿等工具將講者的話轉錄下來，但在現在 VR/AR 的會議內，缺少紀錄並輔助使用者的工具。因此我們提出了 ARM-RSPA，一個即時紀錄會議內容與虛擬環境狀態，並根據使用者要求摘要且重播關鍵的互動的助手系統。透過 AR 重播，ARM-RSPA 提供使用者更加方便且易於理解的方法。在後面的實驗中，ARM-RSPA 展現出在會議中的運用，並成功幫助使用者理解講者所講解的內容。我們也探討 ARM-RSPA 與其他遠端會議差別以及未來 AR 會議工具可能的發展。	zh_TW
dc.description.abstract	Since the pandemic, remote meetings have become the norm for some work environments, with mainstream meeting systems like Google Meet and Microsoft Teams being widely used. With advancements in AR/VR technology, virtual meetings can be further enhanced to provide users with highly immersive and interactive experiences. In traditional meetings, someone typically organizes and records the meeting to clarify content afterward. While video conferences now have tools like transcripts to record speakers' words, VR/AR meetings lack tools to help with the recording and assist users. To address this challenge, we develop ARM-RSPA in the research work, which is a real-time assistant system that records meeting content and virtual environment states, summarizes based on user requests, and replays key interactions. Through AR replay, ARM-RSPA provides users with a more convenient and easily understandable way to realize the content of the meetings. To validate our research, we have conducted various experiments, where ARM-RSPA demonstrates its application in meetings and successfully helps users understand the content explained by speakers. We also discuss the differences between ARM-RSPA and other remote meeting solutions to highlight the high promise of our proposal system. Finally, we also mention our potential developments for other AR meeting tools in the future.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2025-09-10T16:13:42Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2025-09-10T16:13:42Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Acknowledgements i 摘要ii Abstract iii Contents v ListofFigures viii ListofTables x Chapter1 Introduction 1 1.1 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 ThesisOrganization . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter2 PreliminaryWork 7 2.1 RemoteCollaboration . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 SimultaneouslyLocalizationAndMapping . . . . . . . . . . . . . . 8 2.3 HandPoseEstimation . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 HumanAvatar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5 AutomaticSpeechRecognition. . . . . . . . . . . . . . . . . . . . . 12 2.5.1 Whisper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6 LargeLanguageModel . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6.1 Qwen3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.7 MutilmodalLargeLanguageModel . . . . . . . . . . . . . . . . . . 14 2.7.1 CLIP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.7.2 Llava-Next . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.8 VideoSummarization. . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.9 Retrieval-AugmentedGeneration . . . . . . . . . . . . . . . . . . . 17 Chapter3 Methodology 19 3.1 ApplicationDevelopment . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.1 UserInterfaceControl. . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.2 ObjectManipulation . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.3 SocialInteractions. . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.4 MeetingTools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.5 UserEvent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1.5.1 ActionFrameFile . . . . . . . . . . . . . . . . . . . . 30 3.1.6 MeetingAssistant . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 AssistantFramework . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.1 Multi-agentArchitecture . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.3 SummaryGeneration . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.4 PlaybackLogRetriever . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3 ServerCommunicationandSynchronization . . . . . . . . . . . . . 41 3.3.1 CommunicationProtocol . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.2 Orchestration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.3 MediaServers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 UseCaseScenario . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4.1 ScenarioSettings:RemoteCarSalesPresentation . . . . . . . . . . 45 3.4.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1 UserStudy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1.2 ExperimentalSetup . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.1.3 Participant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.1.4 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.1.5 ResultandDiscussion. . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2 RuntimePerformanceEvaluation . . . . . . . . . . . . . . . . . . . 58 4.2.1 ExperimentalSetup . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2.2 ResultandDiscussion. . . . . . . . . . . . . . . . . . . . . . . . . 60 Chapter5 Conclusion 62 References 64 AppendixA—UserStudy 69 A.1 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69	-
dc.language.iso	en	-
dc.subject	虛擬實境/擴增實境	zh_TW
dc.subject	大型語言模型助理	zh_TW
dc.subject	即時摘要	zh_TW
dc.subject	LLM Assistant	en
dc.subject	VR/AR	en
dc.subject	Live Stream Summarization	en
dc.title	AR遠端會議中的即時摘要與重播助理	zh_TW
dc.title	ARM-RSPA: Augmented Reality Meeting with Real-Time Summarization and Playback Assistant	en
dc.type	Thesis	-
dc.date.schoolyear	113-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	歐陽明;陳祝嵩;鄭龍磻;莊永裕;徐偉恩	zh_TW
dc.contributor.oralexamcommittee	Ming Ouhyoung;Chu-Song Chen;Lung-Pan Cheng;Yung-Yu Chuang;Wei En Hsu	en
dc.subject.keyword	虛擬實境/擴增實境,即時摘要,大型語言模型助理,	zh_TW
dc.subject.keyword	VR/AR,Live Stream Summarization,LLM Assistant,	en
dc.relation.page	69	-
dc.identifier.doi	10.6342/NTU202502618	-
dc.rights.note	同意授權(全球公開)	-
dc.date.accepted	2025-08-07	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	2028-09-01	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-113-2.pdf 此日期後於網路公開 2028-09-01	64.01 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。