使用基於任務的跨維度注意力模組之高效多任務學習深度卷積網路應用於面部資訊偵測

賴奕善; Yi-Shan Lai

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90701

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	傅立成	zh_TW
dc.contributor.advisor	Li-Chen Fu	en
dc.contributor.author	賴奕善	zh_TW
dc.contributor.author	Yi-Shan Lai	en
dc.date.accessioned	2023-10-03T17:14:46Z	-
dc.date.available	2023-11-10	-
dc.date.copyright	2023-10-03	-
dc.date.issued	2023	-
dc.date.submitted	2023-08-08	-
dc.identifier.citation	[1] Xiao Li, Dong Zhang, Ming Li, and Dah-Jye Lee. Accurate head pose estimation using image rectification and a lightweight convolutional neural network. IEEE Trans actions on Multimedia, 2022. [2] Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. Mpiigaze: Real world dataset and deep appearance-based gaze estimation. IEEE transactions on pattern analysis and machine intelligence, 41(1):162–175, 2017. [3] Ahmed A Abdelrahman, Thorsten Hempel, Aly Khalifa, and Ayoub Al-Hamadi. L2cs-net: Fine-grained gaze estimation in unconstrained environments. arXiv preprint arXiv:2203.03339, 2022. [4] Ismail Kayadibi, Gür Emre Güraksın, Uçman Ergün, and Nurgül Özmen Süzme. An eye state recognition system using transfer learning: Alexnet-based deep convolu tional neural network. International Journal of Computational Intelligence Systems, 15(1):49, 2022. [5] Ralph Oyini Mbouna, Seong G Kong, and Myung-Geun Chun. Visual analysis of eye state and head pose for driver alertness monitoring. IEEE transactions on intelligent transportation systems, 14(3):1462–1469, 2013. [6] Ceenu George, Daniel Buschek, Andrea Ngao, and Mohamed Khamis. Gazeroom lock: Using gaze and head-pose to improve the usability and observation resistance of 3d passwords in virtual reality. In Augmented Reality, Virtual Reality, and Com puter Graphics: 7th International Conference, AVR 2020, Lecce, Italy, September 7–10, 2020, Proceedings, Part I 7, pages 61–81. Springer, 2020. [7] Tingting Liu, Bing Yang, Hai Liu, Jianping Ju, Jianyin Tang, Sriram Subramanian,62 and Zhaoli Zhang. Gmdl: Toward precise head pose estimation via gaussian mixed distribution learning for students＇attention understanding. Infrared Physics & Tech nology, 122:104099, 2022. [8] Jamie Sherrah, Shaogang Gong, and Eng-Jon Ong. Understanding pose discrimina tion in similarity space. In BMVC, pages 1–10, 1999. [9] Jamie Sherrah, Shaogang Gong, and Eng-Jon Ong. Face distributions in similarity space under varying head pose. Image and Vision Computing, 19(12):807–819, 2001. [10] Nicolas Gourier, Daniela Hall, and James L Crowley. Estimating face orientation from robust detection of salient facial structures. In FG Net workshop on visual observation of deictic gestures, volume 6, page 7. Citeseer, 2004. [11] Martin Koestinger, Paul Wohlhart, Peter M Roth, and Horst Bischof. Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In 2011 IEEE international conference on computer vision workshops (ICCV workshops), pages 2144–2151. IEEE, 2011. [12] Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Z Li. Face alignment across large poses: A 3d solution. In Proceedings of the IEEE conference on com puter vision and pattern recognition, pages 146–155, 2016. [13] Gabriele Fanelli, Matthias Dantone, Juergen Gall, Andrea Fossati, and Luc Van Gool. Random forests for real time 3d face analysis. International journal of computer vision, 101:437–458, 2013. [14] Yihua Cheng, Shiyao Huang, Fei Wang, Chen Qian, and Feng Lu. A coarse-to-fine adaptive network for appearance-based gaze estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10623–10630, 2020. [15] Petr Kellnhofer, Adria Recasens, Simon Stent, Wojciech Matusik, and Antonio Tor ralba. Gaze360: Physically unconstrained gaze estimation in the wild. In Proceed ings of the IEEE/CVF International Conference on Computer Vision, pages 6912–6921, 2019. [16] Zhaojie Liu and Haizhou Ai. Automatic eye state recognition and closed-eye photo correction. In 2008 19th International Conference on Pattern Recognition, pages 1–4. IEEE, 2008. [17] Aleksandra Królak and Paweł Strumiłło. Eye-blink detection system for human–computer interaction. Universal Access in the Information Society, 11:409–419,2012. [18] Ang Liu, Zhichao Li, Lang Wang, and Yong Zhao. A practical driver fatigue detec tion algorithm based on eye state. In 2010 Asia Pacific conference on postgradu ate research in microelectronics and electronics (PrimeAsia), pages 235–238. IEEE, 2010. [19] Mei Wang, Lin Guo, and Wen-Yuan Chen. Blink detection using adaboost and con tour circle for fatigue recognition. Computers & Electrical Engineering, 58:502–512, 2017. [20] Yu-Shan Wu, Ting-Wei Lee, Quen-Zong Wu, and Heng-Sung Liu. An eye state recognition method for drowsiness detection. In 2010 IEEE 71st Vehicular Technol ogy Conference, pages 1–5. IEEE, 2010. [21] Leo Breiman. Random forests. Machine learning, 45:5–32, 2001. [22] Sae International. Taxonomy and definitions for terms related to driving automation systems for on-road motor vehicles. SAE international, 4970(724):1–5, 2018. [23] Dawei Yang, Xinlei Li, Xiaotian Dai, Rui Zhang, Lizhe Qi, Wenqiang Zhang, and Zhe Jiang. All in one network for driver attention monitoring. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2258–2262. IEEE, 2020. [24] Hsueh-Wei Chen, Yi Chen, Pei-Yung Hsiao, Li-Chen Fu, and ZI-RONG DING. Glpose: Global-local attention network with feature interpolation regularization for head pose estimation of people wearing facial masks. In 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022. BMVA Press, 2022. [25] Amit Kumar, Azadeh Alavi, and Rama Chellappa. Kepler: Keypoint and pose esti mation of unconstrained faces by learning efficient h-cnn regressors. In 2017 12th ieee international conference on automatic face & gesture recognition (fg 2017), pages 258–265. IEEE, 2017. [26] Jianzhu Guo, Xiangyu Zhu, Yang Yang, Fan Yang, Zhen Lei, and Stan Z Li. Towards fast, accurate and stable 3d dense face alignment. In Computer Vision–ECCV 2020:16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX, pages 152–168. Springer, 2020. [27] Miao Xin, Shentong Mo, and Yuanze Lin. Eva-gcn: Head pose estimation based on graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1462–1471, 2021. [28] Nataniel Ruiz, Eunji Chong, and James M Rehg. Fine-grained head pose estimation without keypoints. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 2074–2083, 2018. [29] Tsun-Yi Yang, Yi-Ting Chen, Yen-Yu Lin, and Yung-Yu Chuang. Fsa-net: Learning fine-grained structure aggregation for head pose estimation from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog nition, pages 1087–1096, 2019. [30] Oliver Williams, Andrew Blake, and Roberto Cipolla. Sparse and semi-supervised visual mapping with the s^ 3gp. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 1, pages 230–237. IEEE, 2006. [31] Feng Lu, Yusuke Sugano, Takahiro Okabe, and Yoichi Sato. Adaptive linear regres sion for appearance-based gaze estimation. IEEE transactions on pattern analysis and machine intelligence, 36(10):2033–2046, 2014. [32] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [33] Zhaokang Chen and Bertram E Shi. Towards high performance low complexity cali bration in appearance based gaze estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):1174–1188, 2022. [34] Pradipta Biswas et al. Appearance-based gaze estimation using attention and differ ence mechanism. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3143–3152, 2021. [35] Bappaditya Mandal, Liyuan Li, Gang Sam Wang, and Jie Lin. Towards detection of bus driver fatigue based on robust visual analysis of eye state. IEEE Transactions on Intelligent Transportation Systems, 18(3):545–557, 2016. [36] Mohammad Dehnavi and Mohammad Eshghi. Design and implementation of a real time and train less eye state recognition system. EURASIP Journal on Advances in Signal Processing, 2012:1–12, 2012. [37] Xianming Lin, Ling Cai, and Rongrong Ji. An effective eye states detection method based on the projection of the gray interval distribution. In 2015 IEEE International Conference on Image Processing (ICIP), pages 1875–1879. IEEE, 2015. [38] Ying-li Tian, Takeo Kanade, and Jeffrey F Cohn. Eye-state action unit detection by gabor wavelets. In Advances in Multimodal Interfaces—ICMI 2000: Third In ternational Conference Beijing, China, October 14–16, 2000 Proceedings, pages 143–150. Springer, 2000. [39] Hai Yan Yang, Xin Hua Jiang, Lei Wang, and Yong Hui Zhang. Eye statement recognition for driver fatigue detection based on gabor wavelet and hmm. In Applied Mechanics and Materials, volume 128, pages 123–129. Trans Tech Publ, 2012. [40] MarcoJavier Flores, JoséMaría Armingol, and Arturo de la Escalera. Driver drowsi ness warning system using visual information for both diurnal and nocturnal illumi nation conditions. EURASIP journal on advances in signal processing, 2010(1):1–23, 2010. [41] Fengyi Song, Xiaoyang Tan, Xue Liu, and Songcan Chen. Eyes closeness detection from still images with multi-scale histograms of principal oriented gradients. Pattern Recognition, 47(9):2825–2838, 2014. [42] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltz mann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010. [43] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012. [44] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [45] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. [46] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018. [47] Shikun Liu, Edward Johns, and Andrew J Davison. End-to-end multi-task learning with attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1871–1880, 2019. [48] Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3994–4003, 2016. [49] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE con ference on computer vision and pattern recognition, pages 7482–7491, 2018. [50] Thorsten Hempel, Ahmed A Abdelrahman, and Ayoub Al-Hamadi. 6d rotation rep resentation for unconstrained head pose estimation. In 2022 IEEE International Con ference on Image Processing (ICIP), pages 2496–2500. IEEE, 2022. [51] Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019. [52] R. Fusek. Pupil localization using geodesic distance. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11241 LNCS:433–444, 2018. [53] Preeti Nagrath, Rachna Jain, Agam Madan, Rohan Arora, Piyush Kataria, and Jude Hemanth. Ssdmnv2: A real time dnn-based face mask detection system using single shot multibox detector and mobilenetv2. Sustainable cities and society, 66:102692, 2021. [54] Preeti Nagrath, Rachna Jain, Agam Madan, Rohan Arora, Piyush Kataria, and Jude Hemanth. Ssdmnv2: A real time dnn-based face mask detection system using single shot multibox detector and mobilenetv2. Sustainable cities and society, 66:102692, 2021. [55] Xuefeng Liu, Zhenqing Jia, Xiaoke Hou, Min Fu, Li Ma, and Qiaoqiao Sun. Real time marine animal images classification by embedded system based on mobilenet and transfer learning. In OCEANS 2019-Marseille, pages 1–5. IEEE, 2019. [56] Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Model ing task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD international conference on knowledge dis covery & data mining, pages 1930–1939, 2018. [57] Diganta Misra, Trikay Nalamada, Ajay Uppili Arasanipalai, and Qibin Hou. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3139–3148, 2021. [58] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Con volutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018. [59] Gary B Huang, Marwan Mattar, Tamara Berg, and Eric Learned-Miller. Labeled faces in the wild: A database forstudying face recognition in unconstrained envi ronments. In Workshop on faces in’Real-Life’Images: detection, alignment, and recognition, 2008. [60] Paul Viola and Michael J Jones. Robust real-time face detection. International journal of computer vision, 57:137–154, 2004. [61] Xiaoyang Tan, Fengyi Song, Zhi-Hua Zhou, and Songcan Chen. Enhanced pictorial structures for precise eye localization under incontrolled conditions. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 1621–1628. IEEE, 2009. [62] Tobias Fischer, Hyung Jin Chang, and Yiannis Demiris. Rt-gene: Real-time eye gaze estimation in natural environments. In Proceedings of the European conference on computer vision (ECCV), pages 334–352, 2018	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90701	-
dc.description.abstract	近年來，隨著自動駕駛技術的發展，儘管自動駕駛系統的性能日益提高，然而在面對緊急情況時，仍然需要駕駛進行控制權接管的限制成為關注的重點。為了解決這個問題，設計一個可以運行在車輛上的駕駛員監測系統變得越來越重要。透過監測駕駛員的狀態，如頭部姿態、視線方向與開閉眼狀態等資訊，可以幫助自動駕駛系統判斷駕駛人者是否有專注於道路狀況並有能力接管駕駛任務。本論文提出了第一個以全臉影像作為輸入同時進行頭部姿態估計、眼睛狀態偵測和視線狀態估計的深度學習模型，能夠應用在駕駛員狀態監測系統來幫助自動駕駛系統判斷駕駛者是否能夠接管控制權。由於車輛上的自動駕駛系統有硬體上的限制，我們使用多任務學習的方式，以單一模型進行推論，進而避免使用多個模型進行預測時需要大量記憶體的情況。為了能讓模型在每個任務分支中能夠分辨出自身需要的特徵，我們設計跨維度注意力模組來為每個分支增強並篩選出對於各自任務所需要的特徵。此外，為了解決開閉眼狀態資料集沒有包含多角度頭部姿態的問題，我們提出生成具有眼睛狀態標註資料的資料擴增技術，提升模型在不同角度頭部姿態開閉眼偵測的穩定性，並且在公開CEW資料集驗證其可行性，並設立一個以全臉影像為輸入的眼睛狀態偵測基準線。	zh_TW
dc.description.abstract	In recent years, with the development of autonomous driving technology, although the performance of autonomous driving systems has been improved, the limitation of requiring drivers to take control in emergency situations still remains a major concern. It has become increasingly important to design a driver monitoring system that can run on vehicles. By monitoring the driver's status, such as head posture, gaze direction, and open/closed eyes, the autonomous driving system can determine whether the driver is focusing on the road conditions and has the ability to take over driving tasks. In this thesis, we propose the first deep learning model that takes full-face images as input and simultaneously performs head pose estimation, eye state detection, and gaze estimation. The model is applicable to driver-state monitoring systems to assist autonomous driving systems in determining whether the driver is capable of taking control. Due to hardware limitations in vehicle-based autonomous driving systems, we adopt a multi-task learning approach to perform inference using a single model, thereby avoiding the need for multiple models and excessive memory usage during inference. To enable the model to distinguish the relevant features for each task branch, we propose a task-based cross-dimensional attention module that selectively filters and enhances the features required for each respective task. Additionally, to address the lack of multiple head pose angles in the eye state dataset, we propose a data augmentation technique that generates augmented data with eye state labels, improving the stability of the model in detecting eye states under different head pose angles. Finally, we validate the feasibility of the proposed approach by conducting extensive experiments and the performance on AFLW2000, BIWI, and Gaze360 is competitive to the SOTA work. Furthermore, we employ the CEW dataset and introduce a baseline for eye state detection using full-face images as input.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-10-03T17:14:46Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-10-03T17:14:46Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	致謝 i 中文摘要 ii ABSTRACT iii CONTENTS v LIST OF FIGURES viii LIST OF TABLES x Chapter 1 Introduction 1 1.1 Background · · · · · · · · · · · · · · · · · · · · · · · · · · · 1 1.1.1 Head Pose Estimation ...................................... 1 1.1.2 Gaze Estimation ............................................ 3 1.1.3 Eye State Estimation ....................................... 4 1.2 Motivation · · · · · · · · · · · · · · · · · · · · · · · · · · · 5 1.3 Objectives · · · · · · · · · · · · · · · · · · · · · · · · · · · 7 1.4 Related Work · · · · · · · · · · · · · · · · · · · · · · · · · · 8 1.4.1 Head Pose Estimation ....................................... 8 1.4.2 Gaze Estimation ............................................ 9 1.4.3 Eye State Detection ....................................... 10 1.5 Contributions · · · · · · · · · · · · · · · · · · · · · · · · · 11 1.6 Thesis Organization · · · · · · · · · · · · · · · · · · · · · · 11 Chapter 2 Preliminaries 13 2.1 Deep Learning Network · · · · · · · · · · · · · · · · · · · · · 13 2.1.1 Convolutional Layer........................................ 14 2.1.1.1 Original Convolution ................................ 14 2.1.1.2 Depthwise Separable Convolution ..................... 15 2.1.2 Pooling Layer.............................................. 17 2.1.3 Activation Function........................................ 18 2.1.4 Fully Connected Layer...................................... 19 2.1.5 Loss Function ............................................. 20 2.2 Backbone Network Architecture · · · · · · · · · · · · · · · · · 21 2.2.1 VGGNet .................................................... 21 2.2.2 ResNet .................................................... 22 2.2.3 MobileNet.................................................. 24 2.3 Multi-task Learning · · · · · · · · · · · · · · · · · · · · · · 25 Chapter 3 Methodology 28 3.1 Architecture Overview · · · · · · · · · · · · · · · · · · · · · 28 3.2 Data Preprocessing · · · · · · · · · · · · · · · · · · · · · · 31 3.2.1 Rotation Matrix Representation ............................ 31 3.2.2 Eye State Data Synthesis................................... 34 3.3 Shared Feature Backbone · · · · · · · · · · · · · · · · · · · · 37 3.4 Task-based Cross-dimensional Attention Module (TCAM) · · · · · 38 3.5 Loss Function · · · · · · · · · · · · · · · · · · · · · · · · · 42 Chapter 4 Experiments 45 4.1 Dataset· · · · · · · · · · · · · · · · · · · · · · · · · · · · 45 4.1.1 300W across Large Poses Dataset............................ 45 4.1.2 Annotated Facial Landmarks in the Wild 2000 ............... 46 4.1.3 BIWI ...................................................... 47 4.1.4 Gaze360 ................................................... 47 4.1.5 MRL dataset ............................................... 48 4.1.6 Closed Eyes In The Wild ................................... 49 4.2 Evaluation Metrics · · · · · · · · · · · · · · · · · · · · · · 51 4.3 Implementation Details · · · · · · · · · · · · · · · · · · · · 52 4.4 Ablation Study · · · · · · · · · · · · · · · · · · · · · · · · 54 4.5 Quantitative Results · · · · · · · · · · · · · · · · · · · · · 56 4.6 Visualization Result · · · · · · · · · · · · · · · · · · · · · 59 Chapter 5 Conclusion 60 REFERENCES 62	-
dc.language.iso	en	-
dc.subject	深度學習	zh_TW
dc.subject	資料擴增	zh_TW
dc.subject	多任務學習	zh_TW
dc.subject	輕量化模型	zh_TW
dc.subject	data augmentation	en
dc.subject	multi-task learning	en
dc.subject	deep learning	en
dc.subject	light-weight model	en
dc.title	使用基於任務的跨維度注意力模組之高效多任務學習深度卷積網路應用於面部資訊偵測	zh_TW
dc.title	An Efficient Deep Convolutional Network for Face Information Detection with Multi-Task Learning Enhanced by Task-based Cross-Dimensional Attention Module	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	碩士	-
dc.contributor.oralexamcommittee	蕭培墉;黃世勳;方瓊瑤;鄭文皇	zh_TW
dc.contributor.oralexamcommittee	Pei-Yung Hsiao;Shih-Shinh Huang;Chiung-Yao Fang;Wen-Huang Cheng	en
dc.subject.keyword	深度學習,多任務學習,資料擴增,輕量化模型,	zh_TW
dc.subject.keyword	deep learning,multi-task learning,data augmentation,light-weight model,	en
dc.relation.page	70	-
dc.identifier.doi	10.6342/NTU202303294	-
dc.rights.note	同意授權(限校園內公開)	-
dc.date.accepted	2023-08-10	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊工程學系	-
dc.date.embargo-lift	2026-08-07	-
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 未授權公開取用	9.68 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。