基於視覺的輕量級注意力網路於視線估計

陳翊瑄; I-HSUAN CHEN

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92655

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	丁肇隆	zh_TW
dc.contributor.advisor	Chao-Lung Ting	en
dc.contributor.author	陳翊瑄	zh_TW
dc.contributor.author	I-HSUAN CHEN	en
dc.date.accessioned	2024-05-30T16:06:21Z	-
dc.date.available	2024-05-31	-
dc.date.copyright	2024-05-30	-
dc.date.issued	2024	-
dc.date.submitted	2024-05-27	-
dc.identifier.citation	https://www.femh.org.tw/magazine/viewmag?ID=6588 Recasens, A., Khosla, A., Vondrick, C., & Torralba, A. (2015). Where are they looking?. Advances in neural information processing systems, 28. Krafka, K., Khosla, A., Kellnhofer, P., Kannan, H., Bhandarkar, S., Matusik, W., & Torralba, A. (2016). Eye tracking for everyone. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2176-2184). Young, L. R., & Sheena, D. (1975). Survey of eye movement recording methods. Behavior research methods & instrumentation, 7(5), 397-429. George, A. (2019). Image based eye gaze tracking and its applications. arXiv preprint arXiv:1907.04325. Morimoto, C. H., Koons, D., Amir, A., & Flickner, M. (2000). Pupil detection and tracking using multiple light sources. Image and vision computing, 18(4), 331-335. Guestrin, E. D., & Eizenman, M. (2006). General theory of remote gaze estimation using the pupil center and corneal reflections. IEEE Transactions on biomedical engineering, 53(6), 1124-1133. Wang, K., & Ji, Q. (2017). Real time eye gaze tracking with 3d deformable eye-face model. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1003-1011). Kar, A., & Corcoran, P. (2017). A review and analysis of eye-gaze estimation systems, algorithms and performance evaluation methods in consumer platforms. IEEE Access, 5, 16495-16519.. Ba, S. O., & Odobez, J. M. (2010). Multiperson visual focus of attention from head pose and meeting contextual cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), 101-116. Zhang, X., Sugano, Y., Fritz, M., & Bulling, A. (2015). Appearance-based gaze estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4511-4520). Zhang, X., Sugano, Y., Fritz, M., & Bulling, A. (2017). Mpiigaze: Real-world dataset and deep appearance-based gaze estimation. IEEE transactions on pattern analysis and machine intelligence, 41(1), 162-175. Cheng, Y., Lu, F., & Zhang, X. (2018). Appearance-based gaze estimation via evaluation-guided asymmetric regression. In Proceedings of the European conference on computer vision (ECCV) (pp. 100-115). Chen, Z., & Shi, B. E. (2018, December). Appearance-based gaze estimation using dilated-convolutions. In Asian Conference on Computer Vision (pp. 309-324). Cham: Springer International Publishing. Zhang, X., Sugano, Y., Fritz, M., & Bulling, A. (2017). It''s written all over your face: Full-face appearance-based gaze estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 51-60). Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25. Zhang, X., Sugano, Y., & Bulling, A. (2019, May). Evaluation of appearance-based methods and implications for gaze-based applications. In Proceedings of the 2019 CHI conference on human factors in computing systems (pp. 1-13). Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat''s visual cortex. The Journal of physiology, 160(1), 106. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics, 36(4), 193-202. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., ... & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020, August). End-to-end object detection with transformers. In European conference on computer vision (pp. 213-229). Cham: Springer International Publishing. Deng, Y., Tang, F., Dong, W., Ma, C., Pan, X., Wang, L., & Xu, C. (2022). Stytr2: Image style transfer with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11326-11336). Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., & Ding, Z. (2021). 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 11656-11665). Cheng, Y., & Lu, F. (2022, August). Gaze estimation using transformer. In 2022 26th International Conference on Pattern Recognition (ICPR) (pp. 3341-3347). IEEE. Zhang, X., Park, S., Beeler, T., Bradley, D., Tang, S., & Hilliges, O. (2020). Eth-xgaze: A large scale dataset for gaze estimation under extreme head pose and gaze variation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16 (pp. 365-381). Springer International Publishing. Zhang, X., Sugano, Y., Bulling, A., & Hilliges, O. (2020, September). Learning-based region selection for end-to-end gaze estimation. In 31st British Machine Vision Conference (BMVC 2020) (p. 86). British Machine Vision Association. Mehta, S., & Rastegari, M. (2021). Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510-4520). Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., & Liu, W. (2019). Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 603-612). Han, D., Pan, X., Han, Y., Song, S., & Huang, G. (2023). Flatten transformer: Vision transformer using focused linear attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 5961-5971). Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016). Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters, 23(10), 1499-1503. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779-788). Qi, D., Tan, W., Yao, Q., & Liu, J. (2022, October). YOLO5Face: Why reinventing a face detector. In European Conference on Computer Vision (pp. 228-244). Cham: Springer Nature Switzerland. Balim, H., Park, S., Wang, X., Zhang, X., & Hilliges, O. (2023). Efe: End-to-end frame-to-gaze estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2687-2696). Wong, E. T., Yean, S., Hu, Q., Lee, B. S., Liu, J., & Deepu, R. (2019, March). Gaze estimation using residual neural network. In 2019 IEEE international conference on pervasive computing and communications workshops (PerCom Workshops) (pp. 411-414). IEEE.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/92655	-
dc.description.abstract	隨著人工智慧的進步，人機互動技術取得了顯著的進展，視線估計技術已不再受限於以昂貴精密儀器測量之方式，而是透過深度學習方法的應用。這項進步不僅為娛樂領域帶來了新的發展方向，亦對漸凍人等疾病需求，帶來了新的研究方向。然而，將模型部署於移動設備上時，模型之參數量成為了一個重要的考量因素。本研究提出一個基於輕量化Transformer之視線估計模型，於MPIIFaceGaze子集上，以較少之參數量與較低之浮點數運算量，在測試集性能上獲取比先前研究更低之3.98°平均角度誤差。此外，本研究也設計一個簡單的系統，在實驗設備上測試模型性能。於實驗中，本研究以視線區塊為一個實驗單位，並將預估之視線向量，經由影像後處理轉換為螢幕之視線落點。在解析度為1280×720螢幕上，8格視線區塊實驗所預估之視線落點，可以達到100%之準確率，而12格視線區塊實驗視線落點之準確度則約為80%。	zh_TW
dc.description.abstract	With the advancement of artificial intelligence, significant progress has been made in human-computer interaction technology, and gaze estimation techniques are no longer limited to costly and precise instrument measurements but rather are now applied through deep learning methods. This advancement not only brings new directions in the entertainment domain but also opens up new research avenues for conditions such as ALS. However, when deploying models on mobile devices, the parameter count of the model becomes a crucial consideration. This study proposes a gaze estimation model based on lightweight Transformer architecture, which achieves a lower average angular error of 3.98° on the MPIIFaceGaze subset with fewer parameters and lower floating-point operations compared to previous research. Additionally, a simple system is designed to test the model''s performance on experimental devices. In the experiments, gaze blocks are used as experimental units, and the estimated gaze vectors are processed into screen gaze points through post-image processing. On a screen with a resolution of 1280×720 pixels, the estimated gaze points in the 8-grid gaze blocks achieve 100% accuracy, while the accuracy for the 12-grid gaze blocks experiment is approximately 80%.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-05-30T16:06:21Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2024-05-30T16:06:21Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	中文摘要 i ABSTRACT ii 目次 iii 圖次 v 表次 viii 第1章緒論 1 1.1 研究背景與動機 1 1.2 論文架構 2 第2章文獻回顧 4 2.1 視線估計概述 4 2.2 視線估計 6 2.2.1 幾何模型法 7 2.2.2 外觀法 9 2.3 深度學習 12 2.3.1 Transformer 16 2.1.1 Vision-Based Transformer 17 第3章研究方法 19 3.1 資料集 19 3.1.1 MPIIFaceGAZE 19 3.1.2 資料前處理 20 3.2 模型架構 24 3.2.1 特徵擷取 24 3.2.2 Transformer結構 30 3.2.3 損失函數 31 3.2.3 人臉偵測 31 3.2.3 端到端視線估計 32 3.3 評估指標 34 第4章實驗結果與分析 35 4.1 實驗環境設備 35 4.2 訓練參數設定 35 4.3 評估與比較 36 4.4 測試 46 4.4.1 影像後處理 46 4.4.2 測試結果分析 48 第5章結論 58 REFERENCE 59	-
dc.language.iso	zh_TW	-
dc.subject	深度學習	zh_TW
dc.subject	電腦視覺	zh_TW
dc.subject	視線偵測	zh_TW
dc.subject	影像處理	zh_TW
dc.subject	gaze estimation	en
dc.subject	deep learning	en
dc.subject	computer vision	en
dc.subject	image processing	en
dc.title	基於視覺的輕量級注意力網路於視線估計	zh_TW
dc.title	Vision-Based Lightweight Attention Networks for Gaze Estimation	en
dc.type	Thesis	-
dc.date.schoolyear	112-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	張瑞益;陳昭宏;謝傳璋	zh_TW
dc.contributor.oralexamcommittee	Ray-I Chang;Jau-Horng Chen;Chuan-Zhang Xie	en
dc.subject.keyword	深度學習,電腦視覺,視線偵測,影像處理,	zh_TW
dc.subject.keyword	deep learning,computer vision,gaze estimation,image processing,	en
dc.relation.page	63	-
dc.identifier.doi	10.6342/NTU202401003	-
dc.rights.note	未授權	-
dc.date.accepted	2024-05-28	-
dc.contributor.author-college	工學院	-
dc.contributor.author-dept	工程科學及海洋工程學系	-
顯示於系所單位：	工程科學及海洋工程學系

文件中的檔案：

檔案	大小	格式
ntu-112-2.pdf 未授權公開取用	4.63 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。