Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
    • Advisor
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊工程學系
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101408
Full metadata record
???org.dspace.app.webui.jsptag.ItemTag.dcfield???ValueLanguage
dc.contributor.advisor李明穗zh_TW
dc.contributor.advisorMing-Sui Leeen
dc.contributor.author張璟榮zh_TW
dc.contributor.authorChing-Jung Changen
dc.date.accessioned2026-01-27T16:37:12Z-
dc.date.available2026-01-28-
dc.date.copyright2026-01-27-
dc.date.issued2026-
dc.date.submitted2026-01-14-
dc.identifier.citation[1] Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13359–13368, 2021.
[2] Jinfu Liu, Xinshun Wang, Can Wang, Yuan Gao, and Mengyuan Liu. Temporal decoupling graph convolutional network for skeleton-based gesture recognition. IEEE Transactions on Multimedia, 26:811–823, 2023.
[3] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
[4] Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 203–213, 2020.
[5] Chenghong Lu, Yuriya Nakamura, Haru Nagamine, and Lei Jing. Markerless cat life logging system using skeleton data and st-gcn method. In 2023 12th International Conference on Awareness Science and Technology (iCAST), pages 173–177. IEEE, 2023.
[6] Inc. Market Glass. Pet monitoring cameras. https://www.gii.tw/report/go1568162-pet-monitoring-cameras.html, October 2024. Accessed: 2025-12-11.
[7] Hyung-gun Chi, Myoung Hoon Ha, Seunggeun Chi, Sang Wan Lee, Qixing Huang, and Karthik Ramani. Infogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20186–20196, 2022.
[8] Woomin Myung, Nan Su, Jing-Hao Xue, and Guijin Wang. Degcn: Deformable graph convolutional networks for skeleton-based action recognition. IEEE Transactions on Image Processing, 33:2477–2490, 2024.
[9] Edoardo Fazzari, Donato Romano, Fabrizio Falchi, and Cesare Stefanini. Animal behavior analysis methods using deep learning: A survey. Expert Systems With Applications, page 128330, 2025.
[10] Xun Long Ng, Kian Eng Ong, Qichen Zheng, Yun Ni, Si Yong Yeo, and Jun Liu. Animal kingdom: A large and diverse dataset for animal behavior understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19023–19034, 2022.
[11] Dan Liu, Jin Hou, Shaoli Huang, Jing Liu, Yuxin He, Bochuan Zheng, Jifeng Ning, and Jingdong Zhang. Lote-animal: A long time-span dataset for endangered animal behavior understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 20064–20075, 2023.
[12] Jun Chen, Ming Hu, Darren J Coker, Michael L Berumen, Blair Costelloe, Sara Beery, Anna Rohrbach, and Mohamed Elhoseiny. Mammalnet: A large-scale video benchmark for mammal recognition and behavior understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13052–13061, 2023.
[13] Maksim Kholiavchenko, Jenna Kline, Michelle Ramirez, Sam Stevens, Alec Sheets, Reshma Babu, Namrata Banerji, Elizabeth Campolongo, Matthew Thompson, Nina Van Tiel, et al. Kabr: In-situ dataset for kenyan animal behavior recognition from drone videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 31–40, 2024.
[14] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
[15] Jinmiao Cai, Nianjuan Jiang, Xiaoguang Han, Kui Jia, and Jiangbo Lu. Jolo-gcn: mining joint-centered light-weight information for skeleton-based action recognition. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2735–2744, 2021.
[16] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014.
[17] Antti Vehkaoja, Sanni Somppi, Heini Törnqvist, Anna Valldeoriola Cardó, Pekka Kumpulainen, Heli Väätäjä, Päivi Majaranta, Veikko Surakka, Miiamaaria V Kujala, and Outi Vainio. Description of movement sensor dataset for dog behavior classification. Data in Brief, 40:107822, 2022.
[18] Marcos Roberto e Souza, Helena de Almeida Maia, and Helio Pedrini. Survey on digital video stabilization: Concepts, methods, and challenges. ACM Computing Surveys (CSUR), 55(3):1–37, 2022.
[19] Qian Xu, Qian Huang, Chuanxu Jiang, Xin Li, and Yiming Wang. Video stabilization: A comprehensive survey from classical mechanics to deep learning paradigms. Modelling, 6(2):49, 2025.
[20] Yingying Jiao, Zhigang Wang, Sifan Wu, Shaojing Fan, Zhenguang Liu, Zhuoyue Xu, and Zheqi Wu. Spatiotemporal learning for human pose estimation in sparsely-labeled videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 4093–4101, 2025.
[21] Lei Wang and Piotr Koniusz. Flow dynamics correction for action recognition. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3795–3799. IEEE, 2024.
[22] Abhishek Thakur, Zoe Papakipos, Christian Clauss, Christian Hollinger, Vincent Boivin, Benjamin Lowe, Mickaël Schoentgen, and Renaud Bouckenooghe. abhitronix/vidgear: Vidgear stable v0. 2.5. Zenodo.
[23] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV), pages 466–481, 2018.
[24] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5693–5703, 2019.
[25] Rawal Khirodkar, Visesh Chari, Amit Agrawal, and Ambrish Tyagi. Multi-instance pose networks: Rethinking top-down pose estimation. In Proceedings of the IEEE/CVF International conference on computer vision, pages 3122–3131, 2021.
[26] Mikko Vihlman and Arto Visala. Optical flow in deep visual tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12112–12119, 2020.
[27] Bharat Singh, Tim K Marks, Michael Jones, Oncel Tuzel, and Ming Shao. A multi-stream bi-directional recurrent neural network for fine-grained action detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1961–1970, 2016.
[28] Glenn Jocher and Ultralytics. Ultralytics YOLOv8 documentation. https://docs.ultralytics.com, 2023. Accessed: 2025-01-01.
[29] Yang Fu, Saihui Hou, Shibei Meng, Xuecai Hu, Chunshui Cao, Xu Liu, and Yongzhen Huang. Cut out the middleman: Revisiting pose-based gait recognition. In European Conference on Computer Vision, pages 112–128. Springer, 2024.
[30] M. Saqlain and Others. 3dmesh-gar: 3d human body mesh-based method for gait recognition. Sensors, 2022.
[31] Cunling Bian, Wei Feng, Liang Wan, and Song Wang. Structural knowledge distillation for efficient skeleton-based action recognition. IEEE Transactions on Image Processing, 30:2963–2976, 2021.
[32] Xin Shen, Xinyu Wang, Lei Shen, Kaihao Zhang, and Xin Yu. Cross-view isolated sign language recognition via view synthesis and feature disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20647–20657, 2025.
[33] Caleb Weinreb, Jonah E Pearl, Sherry Lin, Mohammed Abdal Monium Osman, Libby Zhang, Sidharth Annapragada, Eli Conlin, Red Hoffmann, Sofia Makowska, Winthrop F Gillis, et al. Keypoint-moseq: parsing behavior by linking point tracking to pose dynamics. Nature Methods, 21(7):1329–1339, 2024.
[34] Alexander Mathis, Pranav Mamidanna, Kevin M Cury, Taiga Abe, Venkatesh N Murthy, Mackenzie Weygandt Mathis, and Matthias Bethge. Deeplabcut: marker-less pose estimation of user-defined body parts with deep learning. Nature neuroscience, 21(9):1281–1289, 2018.
[35] Jiahao Xu and Others. ViTPose: Simple vision transformer baselines for human pose estimation. In NeurIPS, 2022.
[36] Hang Yu, Yufei Xu, Jing Zhang, Wei Zhao, Ziyu Guan, and Dacheng Tao. Ap-10k: A benchmark for animal pose estimation in the wild. arXiv preprint arXiv:2108.12617, 2021.
[37] Lianghua Huang, Yu Liu, Bin Wang, Pan Pan, Yinghui Xu, and Rong Jin. Self-supervised video representation learning by context and motion decoupling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13886–13895, 2021.
[38] Jue Wang, Anoop Cherian, and Fatih Porikli. Ordered pooling of optical flow sequences for action recognition. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 168–176. IEEE, 2017.
[39] Gustave Udahemuka, Karim Djouani, and Anish M Kurien. Multimodal emotion recognition using visual, vocal and physiological signals: a review. Applied Sciences, 14(17):8071, 2024.
[40] Yihan Wang, Lahav Lipson, and Jia Deng. Sea-raft: Simple, efficient, accurate raft for optical flow. In European Conference on Computer Vision, pages 36–54. Springer, 2024.
[41] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In European conference on computer vision, pages 402–419. Springer, 2020.
[42] Nadhira Noor, Fabianaugie Jametoni, Jinbeom Kim, Hyunsu Hong, and In Kyu Park. Efficient skeleton-based action recognition for real-time embedded systems. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5889–5897, 2024.
[43] Ning Ma, Hongyi Zhang, Xuhui Li, Sheng Zhou, Zhen Zhang, Jun Wen, Haifeng Li, Jingjun Gu, and Jiajun Bu. Learning spatial-preserved skeleton representations for few-shot action recognition. In European Conference on Computer Vision, pages 174–191. Springer, 2022.
[44] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1010–1019, 2016.
[45] Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, and Bo Dai. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2969–2978, 2022.
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/101408-
dc.description.abstract本論文針對寵物動作辨識在技術與資料面所面臨的挑戰進行研究,並特別聚焦於家犬家貓自動化健康監測技術中的關鍵缺口。本研究的主要貢獻是建置 PetAction:一個經整理與篩選的影片資料集,並獨特地納入臨床上具重要意義的異常行為,例如癲癇發作、嘔吐與運動障礙等,彌補過往動物行為基準中此類資料多數缺乏的問題。為了在顯著的外觀差異與複雜的身體形變情境下有效建模這些複雜行為,本研究提出一套完整的架構。其中,所提出的Recognition: JOFF(Joint–Optical Flow Fusion)模組採用多串流設計,整合稀疏的骨架幾何資訊與稠密的光流動態資訊。透過納入 Joint Stream 以捕捉身體結構、Local Flow Stream 以編碼細微的組織運動,以及 I3DGCN Stream 以維持全域時間一致性,模型克服了單獨使用關鍵點時的固有限制。實驗結果顯示,本研究框架具備具競爭力的表現:在 PetAction 上達到 78.01% 準確率,並在野生動物資料集 KABR 上以 85.60% 展現良好的泛化能力。消融實驗也進一步證實,透過光流進行顯式的運動建模,對於區分視覺上容易混淆、且僅靠骨架結構不足以判別的行為至關重要。zh_TW
dc.description.abstractThis thesis addresses the technical and data-related challenges in pet action recognition, specifically targeting the critical gap in automated health monitoring technologies for domestic cats and dogs. The primary contribution of this work is the construction of PetAction, a curated video dataset that uniquely includes clinically relevant abnormal behaviors, such as seizures, vomiting, and movement disorders, which were largely absent in prior animal behavior benchmarks. To effectively model these complex behaviors under significant appearance variations and complex body deformations, a comprehensive framework was developed. The proposed Recognition: JOFF (Joint-Optical Flow Fusion) module introduces a multi-stream architecture that synergistically integrates sparse skeletal geometry with dense optical flow dynamics. By incorporating Joint Stream to capture anatomical structure, Local Flow Stream to encode subtle tissue movements, and I3DGCN Stream to preserve global temporal consistency, the model overcomes the inherent limitations of using keypoints. Experimental results demonstrate that the proposed framework achieves competitive performance, recording 78.01% accuracy on PetAction and showing strong generalization with 85.60% accuracy on the wild-animal dataset KABR. The ablation studies further confirmed that explicit motion modeling via optical flow is critical for distinguishing visually ambiguous behaviors where skeletal structure alone is insufficient.en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2026-01-27T16:37:12Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2026-01-27T16:37:12Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsAcknowledgements i
摘要 ii
Abstract iii
Contents v
List of Figures viii
List of Tables x
Chapter 1 Introduction 1
Chapter 2 Related Work 4
2.1 Human Action Recognition . . . . . . . . . . . . . . . . . . . . . . . .4
2.1.1 Skeleton-based HAR . . . . . . . . . . . . . . . . . . . . . . . . . .5
2.1.2 RGB and Optical-flow-based HAR . . . . . . . . . . . . . . . . . . . .7
2.2 Animal Action Recognition . . . . . . . . . . . . . . . . . . . . . . .9
2.2.1 Overview of Existing Animal Action Datasets . . . . . . . . . . . . .9
2.2.2 Models for Animal Action Recognition . . . . . . . . . . . . . . . . .11
Chapter 3 Dataset: PetAction 13
3.1 Data Collection and Annotation . . . . . . . . . . . . . . . . . . . . .13
3.2 Action Taxonomy and Definitions . . . . . . . . . . . . . . . . . . . .14
3.3 Dataset Statistics and Benchmark Protocol . . . . . . . . . . . . . . .20
3.4 Comparison with Existing Datasets . . . . . . . . . . . . . . . . . . .21
Chapter 4 Method 22
4.1 Video Stabilization . . . . . . . . . . . . . . . . . . . . . . . . . .22
4.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . .24
4.2.1 Object-Centric Crop and Isotropic Resize with Padding . . . . .24
4.2.2 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . .25
4.2.3 Optical Flow Extraction . . . . . . . . . . . . . . . . . . . . . . .27
4.2.4 Local Flow Patch Sampling . . . . . . . . . . . . . . . . . . . . . .29
4.3 Recognition: JOFF (Joint-Optical Flow Fusion) . . . . . . . . . . . . .31
4.3.1 Joint Stream and Local Flow Stream . . . . . . . . . . . . . . . . . .33
4.3.2 I3DGCN Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . .36
Chapter 5 Experiment 39
5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39
5.1.1 Source Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . .40
5.1.2 Data Augmentation Strategy . . . . . . . . . . . . . . . . . . . . . .41
5.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . .42
5.2.1 Environment and Training Configuration . . . . . . . . . . . . . . . .42
5.2.2 Training Objective . . . . . . . . . . . . . . . . . . . . . . . . . .43
5.2.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . .44
5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . .45
5.3.1 Quantitative Comparison . . . . . . . . . . . . . . . . . . . . . . .45
5.3.2 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . .47
5.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52
5.4.1 Impact of Motion Representation . . . . . . . . . . . . . . . . . . .53
5.4.2 Effectiveness of Multi-Stream Integration . . . . . . . . . . . . . .55
5.4.3 Hyperparameter Analysis . . . . . . . . . . . . . . . . . . . . . . .59
Chapter 6 Conclusion 61
References 64
-
dc.language.isoen-
dc.subject寵物動作辨識-
dc.subject多模態學習-
dc.subject光流-
dc.subject圖卷積神經網路-
dc.subject異常行為偵測-
dc.subjectPet Action Recognition-
dc.subjectMulti-modal Learning-
dc.subjectOptical Flow-
dc.subjectGraph Convolutional Networks-
dc.subjectAbnormal Behavior Detection-
dc.title融合關節與光流特徵之多流寵物動作辨識架構zh_TW
dc.titleA Multi-Stream Framework Integrating Joint and Optical-Flow Representations for Pet Action Recognitionen
dc.typeThesis-
dc.date.schoolyear114-1-
dc.description.degree碩士-
dc.contributor.oralexamcommittee葉家宏;葉梅珍zh_TW
dc.contributor.oralexamcommitteeChia-Hung Yeh;Mei-Chen Yehen
dc.subject.keyword寵物動作辨識,多模態學習光流圖卷積神經網路異常行為偵測zh_TW
dc.subject.keywordPet Action Recognition,Multi-modal LearningOptical FlowGraph Convolutional NetworksAbnormal Behavior Detectionen
dc.relation.page70-
dc.identifier.doi10.6342/NTU202600044-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2026-01-15-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept資訊工程學系-
dc.date.embargo-lift2026-01-28-
Appears in Collections:資訊工程學系

Files in This Item:
File SizeFormat 
ntu-114-1.pdf160.53 MBAdobe PDFView/Open
Show simple item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved