請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71386完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 許永真(Yung-jen Hsu) | |
| dc.contributor.author | Pei-Ya Chiu | en |
| dc.contributor.author | 邱培雅 | zh_TW |
| dc.date.accessioned | 2021-06-17T05:59:52Z | - |
| dc.date.available | 2024-02-19 | |
| dc.date.copyright | 2019-02-19 | |
| dc.date.issued | 2019 | |
| dc.date.submitted | 2019-02-13 | |
| dc.identifier.citation | [1] A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi, et al. Video in sentences out. arXiv preprint arXiv:1204.2742, 2012.
[2] M. Boyle and S. Greenberg. The language of privacy: Learning from video media space analysis and design. ACM Transactions on Computer-Human Interaction (TOCHI), 12(2):328–370, 2005. [3] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and de- scribing arbitrary activities using semantic hierarchies and zero-shot recog- nition. In Proceedings of the IEEE international conference on computer vision, pages 2712–2719, 2013. [4] A. Gupta, P. Srinivasan, J. Shi, and L. S. Davis. Understanding videos, constructing plots learning a visually grounded storyline model from anno- tated videos. In Computer vision and pattern recognition, 2009. CVPR 2009. IEEE conference on, pages 2012–2019. IEEE, 2009. [5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [6] M. A. Hossan, S. Memon, and M. A. Gregory. A novel approach for mfcc feature extraction. In Signal Processing and Communication Systems (IC- SPCS), 2010 4th International Conference on, pages 1–5. IEEE, 2010. [7] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. [8] Q. Jin, J. Chen, S. Chen, Y. Xiong, and A. Hauptmann. Describing videos using multi-modal fusion. In Proceedings of the 2016 ACM on Multimedia Conference, pages 1087–1091. ACM, 2016. [9] M. U. G. Khan, L. Zhang, and Y. Gotoh. Human focused video description. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE Interna- tional Conference on, pages 1480–1487. IEEE, 2011. [10] A. Kojima, T. Tamura, and K. Fukunaga. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision, 50(2):171–184, 2002. [11] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014. [12] S. Phan, Y. Miyao, and S. Satoh. Manet: A modal attention network for de- scribing videos. In Proceedings of the 2017 ACM on Multimedia Conference, pages 1889–1894. ACM, 2017. [13] V. Ramanishka, A. Das, D. H. Park, S. Venugopalan, L. A. Hendricks, M. Rohrbach, and K. Saenko. Multimodal video description. In Proceedings of the 2016 ACM on Multimedia Conference, pages 1092–1096. ACM, 2016. [14] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015. [15] K. Simonyan and A. Zisserman. Very deep convolutional networks for large- scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [16] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014. [17] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4, inception- resnet and the impact of residual connections on learning. In AAAI, vol- ume 4, page 12, 2017. [18] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Pro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015. [19] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016. [20] N. Takahashi, M. Gygli, and L. Van Gool. Aenet: Learning deep audio features for video analysis. IEEE Transactions on Multimedia, 20(3):513– 524, 2018. [21] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015. [22] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision, pages 4534–4542, 2015. [23] J. Xu, T. Mei, T. Yao, and Y. Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5288–5296, 2016. [24] L. Yu, W. Zhang, J. Wang, and Y. Yu. Seqgan: Sequence generative adver- sarial nets with policy gradient. In AAAI, pages 2852–2858, 2017. | |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71386 | - |
| dc.description.abstract | 機器生成影片描述能增進機器人與人類溝通的能力。而邊緣運算能提供更快的回應速度和保護隱私的節點運算。若能在邊緣設備上運行「機器生成影片描述」,將可以使智能機器人與人類進行即時互動,同時減少上傳原始資料到雲端伺服器所造成的隱私問題。然而,類神經網路所需的大量計算資源,會導致影片描述很難運行在邊際設備上。而先前的文獻也缺乏相關研究。本論文旨在解決影片描述在邊緣設備上的運算量問題。
為了找到現行的運算瓶頸,我們分析了影片描述模型的架構以及所需運算量。從分析結果來看,影片描述模型的運算瓶頸是卷積網路特徵提取。為了解決這個問題,我們提出了一種名為「 多模態槓桿」的方法。此方法使用較小的卷積網路特徵提取器,並透過其他低計算量特徵資料(如音頻特徵或其他描述資料)來補償預期的準確度損失。 我們的結果顯示,多模態槓桿可以在降低計算負荷 92% 的同時,只讓 METEOR 分數下降 4%。而在邊緣設備所需的運算時間也減少了30%。 | zh_TW |
| dc.description.abstract | Video description is important for its ability to enable smart robots’ interaction with humans. At the same time, edge computing provides fast round-trip and privacy-preserving computing. Combining video description and edge computing decrease the response time of smart robots and reduces the privacy-concern of transmitting raw data to cloud servers. However, the heavy computational load makes running video description on edge difficult. Also, there is a lack of studies of such application in the literature. In this thesis, we investigate the computational problem of video description on edge devices.
To find the computational bottleneck, we first analyze a CNN- based video description method. The bottleneck turns out to be the CNN feature extraction. To solve the problem, we utilize an approach called “multi-modal leveraging” which uses small CNN feature extractors and compensates for the expected accuracy loss by other low-cost features such as audio features or metadata. Our experiments showed that the proposed method reduced 30% of the inference time and 92% of the computation load with a 4% accuracy drop on METEOR scores. | en |
| dc.description.provenance | Made available in DSpace on 2021-06-17T05:59:52Z (GMT). No. of bitstreams: 1 ntu-108-P05922002-1.pdf: 708662 bytes, checksum: 22bdb42d44f0c6c64ece780ee7959a89 (MD5) Previous issue date: 2019 | en |
| dc.description.tableofcontents | Chapter 1 Introduction p.1
1.1 BackgroundandMotivation p.1 1.2 ThesisObjective p.2 1.3 ThesisOrganization p.3 Chapter 2 Related Work p.5 2.1 Video Description p.5 2.2 Multi-modalVideoRepresentation p.6 2.3 EdgeAI p.7 Chapter 3 Problem Statement p.9 3.1 Notations p.9 3.2 Problem Definition p.10 3.3 Symbols Table p.10 Chapter 4 Methodology p.13 4.1 Video Description p.13 4.1.1 Feature Extraction(Encoder) p.14 4.1.2 Language Model(Decoder) p.16 4.2 Computational Load Observations p.16 4.3 Multi-modal Leveraging Approach (Encoder) p.17 4.3.1 Pre-trained CNN Models p.18 4.3.2 Multi-model Feature Selection p.19 Chapter 5 Experiments p.21 5.1 ExperimentsSetup p.21 5.1.1 Dataset p.21 5.1.2 Compared Multi-modal Features p.22 5.1.3 Hyperparameters and Implementation p.23 5.1.4 Metrics p.24 5.2 Results p.25 5.2.1 Computational Load Evaluation p.26 5.2.2 Sentence Quality Evaluation p.26 5.2.3 Inference Time Evaluation p.27 5.3 Discussions p.29 5.3.1 The Effects of Reducing FLOPs p.29 5.3.2 Overall Scores v.s. Single Sentence p.29 5.3.3 The Influence of Beam Width Selection p.30 Chapter 6 Conclusion p.33 6.1 Summary and Contributions p.33 6.2 FutureWork p.34 Bibliography p.35 | |
| dc.language.iso | en | |
| dc.subject | 多元特徵 | zh_TW |
| dc.subject | 影片描述 | zh_TW |
| dc.subject | 邊緣運算 | zh_TW |
| dc.subject | Multi-modal | en |
| dc.subject | Edge Computing | en |
| dc.subject | Video Description | en |
| dc.title | 以多元特徵方法優化邊際設備上的影片描述 | zh_TW |
| dc.title | Edge-friendly Video Description by Leveraging Multi-modal Features | en |
| dc.type | Thesis | |
| dc.date.schoolyear | 107-1 | |
| dc.description.degree | 碩士 | |
| dc.contributor.oralexamcommittee | 陳維超(Wei-Chao Chen),李宏毅(Hung-yi Lee),古倫維(Lun-Wei Ku),劉昭麟(Chao-Lin Liu) | |
| dc.subject.keyword | 多元特徵,邊緣運算,影片描述, | zh_TW |
| dc.subject.keyword | Multi-modal,Edge Computing,Video Description, | en |
| dc.relation.page | 38 | |
| dc.identifier.doi | 10.6342/NTU201900506 | |
| dc.rights.note | 有償授權 | |
| dc.date.accepted | 2019-02-13 | |
| dc.contributor.author-college | 電機資訊學院 | zh_TW |
| dc.contributor.author-dept | 資訊工程學研究所 | zh_TW |
| 顯示於系所單位: | 資訊工程學系 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-108-1.pdf 未授權公開取用 | 692.05 kB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
