情境式影像合成以多模態引導方法

郭宗翰; Tsung-Han Kuo

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90176

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	郭大維	zh_TW
dc.contributor.advisor	Tei-Wei Kuo	en
dc.contributor.author	郭宗翰	zh_TW
dc.contributor.author	Tsung-Han Kuo	en
dc.date.accessioned	2023-09-22T17:43:54Z	-
dc.date.available	2023-11-10	-
dc.date.copyright	2023-09-22	-
dc.date.issued	2023	-
dc.date.submitted	2023-08-11	-
dc.identifier.citation	D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985. S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative adversarial nets (gans). In International Conference on Machine Learning, pages 224–232. PMLR, 2017. M. E. Benson, M. Reichelderfer, A. Said, E. A. Gaumnitz, and P. R. Pfau. Variation in colonoscopic technique and adenoma detection rates at an academic gastroenterology unit. Digestive diseases and sciences, 55(1):166, 2010. H. Borgli, V. Thambawita, P. H. Smedsrud, S. Hicks, D. Jha, S. L. Eskeland, K. R. Randel, K. Pogorelov, M. Lux, D. T. D. Nguyen, D. Johansen, C. Griwodz, H. K. Stensland, E. Garcia-Ceja, P. T. Schmidt, H. L. Hammer, M. A. Riegler, P. Halvorsen, and T. de Lange. Hyper-kvasir: A comprehensive multi-class image and video dataset for gastrointestinal endoscopy, Dec 2019. L. Breiman. Random forests. Machine learning, 45:5–32, 2001. K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021. D. O. Faigel, I. M. Pike, T. H. Baron, A. Chak, J. Cohen, S. E. Deal, B. Hoffman, B. C. Jacobson, K. Mergener, B. T. Petersen, et al. Quality indicators for gastrointestinal endoscopic procedures: an introduction. Gastrointestinal endoscopy, 63(4):S3–S9, 2006. H. Fang, W. Deng, Y. Zhong, and J. Hu. Triple-GAN: Progressive face aging with triple translation loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 804–805, 2020. D. Gong, L. Wu, J. Zhang, G. Mu, L. Shen, J. Liu, Z. Wang, W. Zhou, P. An, X. Huang, et al. Detection of colorectal adenomas with a real-time computer-aided system (endoangel): a randomised controlled study. The Lancet Gastroenterology & Hepatology, 2020. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. N. Huang, F. Tang, W. Dong, and C. Xu. Draw your art dream: Diverse digital art synthesis with multimodal guided diffusion. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1085–1094, 2022. Z. Huang, S. Chen, J. Zhang, and H. Shan. PFA-GAN: Progressive face aging with generative adversarial network. IEEE Transactions on Information Forensics and Security, 16:2031–2045, 2020. I. Kemelmacher-Shlizerman, S. Suwajanakorn, and S. M. Seitz. Illumination-aware age progression. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3334–3341, 2014. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. S. Liu, Y. Sun, D. Zhu, R. Bao, W. Wang, X. Shu, and S. Yan. Face aging with contextual generative adversarial nets. In Proceedings of the 25th ACM international conference on Multimedia, pages 82–90, 2017. X. Liu, D. H. Park, S. Azadi, G. Zhang, A. Chopikyan, Y. Hu, H. Shi, A. Rohrbach, and T. Darrell. More control for free! image synthesis with semantic diffusion guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 289–299, 2023. C. Lugaresi, J. Tang, H. Nash, C. McClanahan, E. Uboweja, M. Hays, F. Zhang, C.-L. Chang, M. G. Yong, J. Lee, et al. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 2019. J. Masci, U. Meier, D. Cireşan, and J. Schmidhuber. Stacked convolutional autoencoders for hierarchical feature extraction. In International conference on artificial neural networks, pages 52–59. Springer, 2011. S. Palsson, E. Agustsson, R. Timofte, and L. Van Gool. Generative adversarial style transfer networks for face aging. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 2084–2092, 2018. J. P. Pluim, J. A. Maintz, and M. A. Viergever. Mutual-information-based registration of medical images: a survey. IEEE transactions on medical imaging, 22(8):986–1004, 2003. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. A. Ratner, S. Bach, P. Varma, and C. Ré. Weak supervision: the new programming paradigm for machine learning. Hazy Research. Available via https://dawn.cs. stanford. edu//2017/07/16/weak-supervision/. Accessed, pages 05–09, 2019. K. Ricanek and T. Tesafaye. Morph: A longitudinal image database of normal adult age-progression. In 7th international conference on automatic face and gesture recognition (FGR06), pages 341–345. IEEE, 2006. E. Richardson and Y. Weiss. On gans and gmms. Advances in Neural Information Processing Systems, 31, 2018. T. Ross, D. Zimmerer, A. Vemuri, F. Isensee, M. Wiesenfarth, S. Bodenstedt, F. Both, P. Kessler, M. Wagner, B. Müller, et al. Exploiting the potential of unlabeled endoscopic video data with self-supervised learning. International journal of computer assisted radiology and surgery, 13(6):925–933, 2018. C. E. Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948. A. Shaukat, T. S. Rector, T. R. Church, F. A. Lederle, A. S. Kim, J. M. Rank, and J. I. Allen. Longer withdrawal time is associated with a reduced incidence of interval cancer after screening colonoscopy. Gastroenterology, 149(4):952–957, 2015. R. L. Siegel, K. D. Miller, A. Goding Sauer, S. A. Fedewa, L. F. Butterly, J. C. Anderson, A. Cercek, R. A. Smith, and A. Jemal. Colorectal cancer statistics, 2020. CA: a cancer journal for clinicians, 2020. J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015. J. Suo, X. Chen, S. Shan, W. Gao, and Q. Dai. A concatenational graph evolution aging model. IEEE transactions on pattern analysis and machine intelligence, 34(11):2083–2096, 2012. J. Suo, S.-C. Zhu, S. Shan, and X. Chen. A compositional and dynamic model for face aging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(3):385–401, 2009. B. Tiddeman, M. Burt, and D. Perrett. Prototyping and transforming facial textures for perception research. IEEE computer graphics and applications, 21(5):42–50, 2001. G. Urban, P. Tripathi, T. Alkayali, M. Mittal, F. Jalali, W. Karnes, and P. Baldi. Deep learning localizes and identifies polyps in real time with 96% accuracy in screening colonoscopy. Gastroenterology, 155(4):1069–1078, 2018. R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.	-
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90176	-
dc.description.abstract	隨著 2012 年深度學習的起飛，基於計算機視覺的 AI 系統顯著的提高了辨識和檢測能力。然而， AI 系統非常依賴經過標註的訓練數據，而與數據標註相關的可觀成本很可能會阻礙 AI 系統的發展。因此，針對監督式學習中訓練數據不足的問題，本文探索了自動標註和數據合成方法。在自動標註方面，我們借力深度學習技術，根據可疑結腸粘膜特徵與相對應的結腸鏡篩檢速度之間的相關性標註了需要進一步檢查的視頻圖像。在數據合成方面，我們為一個更大型的生成網路提出自下而上的訓練方法，並評估其在面部老化圖像合成方面的有效性。具體來說，這個更大型的網路由兩個 CycleGAN 所串聯而成，用於合成面部老化和年輕化圖像，我們稱為 BiTrackGAN。自下而上的訓練在這兩個 CycleGAN 之間引入了一個理想的中間狀態，即約束機制。結果表明，BiTrackGAN 通過約束機制合成出更平滑自然的面部老化和年輕化圖像。此外，它還提高了多樣性地合成效果。更進一步的，我們提出了一種名為 KFS 的分類器來評估跨年齡面部特徵相似性，作用在引導擴散生成模型中合成情境式面部老化圖像。就技術而言，KFS 所評估的面部特徵相似性指標被用來當作計算與參考圖像相似性損失的正則化項，使得透過多模態擴散引導所合成的情境式面部老化圖像更加接近情境文字提示，效果也更加協調。據我們所知，我們是第一個提出情境式面部老化的研究。我們認為這將會是面部老化和任何與生長或老化相關的圖像合成問題的新穎流行技術，包括醫學領域的腫瘤圖像合成。簡言之，我們所提出的方法有能力取得或者合成出最新的、最貼近實際的、且多樣化的數據用來進行模型訓練，而不僅僅是大量的數據。	zh_TW
dc.description.abstract	With the deep learning take-off in 2012, computer vision's AI system has significantly improved recognition and detection. However, AI development still relies heavily on labeled training data, and the considerable costs associated with data labeling can hinder their advancement. To address the issue of insufficient training data in supervised learning, this dissertation explores automatic data labeling and data synthesis approaches. For automatic data labeling, we utilize deep learning techniques to automatically label images that may require further examination based on the correlation between suspicious colonic mucosal features and the corresponding speed of colonoscopy screening. For data synthesis, we propose a bottom-up training method for a larger generative network and evaluate its effectiveness on facial aging image synthesis. Specifically, that is a translation pipeline with two CycleGAN blocks cascaded to synthesize facial aging and rejuvenation, named BiTrackGAN. Bottom-up training induces an ideal intermediate state between these two CycleGAN blocks, namely the constraint mechanism. According to the results, BiTrackGAN synthesizes smoother, more progressive facial aging and rejuvenation while improving synthesis diversity. Furthermore, a KFS classifier was proposed to evaluate cross-age facial feature similarity to synthesize contextual facial aging images in a guided diffusion model. Technically, that is the KFS similarity metric as a regularization term to referenced image loss, making synthesizing contextual aged facial images in multimodal diffusion guidance to be more closely text prompt and coordinated. To our best knowledge, we are the first to propose and conduct research on contextual facial aging synthesis. We believe this novel technique will broadly apply to facial aging and other growth or aging-related image synthesis problems, including tumor image synthesis in the medical field. In summary, our works could obtain or synthesize the most recent, realistic, and diverse data for model training, not just large quantities.	en
dc.description.provenance	Submitted by admin ntu (admin@lib.ntu.edu.tw) on 2023-09-22T17:43:54Z No. of bitstreams: 0	en
dc.description.provenance	Made available in DSpace on 2023-09-22T17:43:54Z (GMT). No. of bitstreams: 0	en
dc.description.tableofcontents	Acknowledgements 3 摘要 5 Abstract 7 Contents 9 List of Figures 15 List of Tables 19 Denotation 21 Chapter 1 Introduction 1 1.1 Background, Motivations, and Objectives 1 1.2 Contribution 3 Chapter 2 Automatic Colonoscopy Labeling 5 2.1 Introduction 5 2.2 Related Works 6 2.2.1 AI-assisted System for the Colonoscopy 6 2.2.2 Self-Supervised Learning 7 2.2.3 Withdrawal Speed Control 8 2.3 Preliminary 8 2.3.1 Auto-Encoder 9 2.3.1.1 Convolutional Neural Networks (CNNs) 9 2.3.1.2 Stacked Convolutional Auto-Encoders (CAES) 10 2.3.2 Mutual Information 10 2.3.3 Recurrent Neural Networks (RNNs) 11 2.4 Automatic Colonoscopy Labeling by Self-Supervised Learning from Withdrawal Speed 14 2.4.1 Features Extraction 14 2.4.2 Moving Average Similarity Evaluation 16 2.4.2.1 Formulation of the Training Dataset 18 2.4.2.2 Regression Model Training 18 2.5 Experiments 21 2.5.1 System Setup 21 2.5.2 Dataset 21 2.5.3 Implementations 22 2.5.3.1 Auto-Encoder 22 2.5.3.2 LSTM 22 2.5.4 Performance 22 2.5.4.1 Evaluation Metrics 23 2.5.4.2 Cross-Validation 24 2.5.4.3 Quantitative Comparison 24 2.5.4.4 Qualitative Comparison 25 2.5.4.5 Automatic Labeling Results 27 2.6 Conclusion 29 Chapter 3 Constraint Facial Aging Image Synthesis 31 3.1 Introduction 31 3.2 Related Works 34 3.2.1 Conventional Facial Aging Synthesis Approaches 34 3.2.2 Deep Learning Based Facial Aging Synthesis 34 3.2.3 Mode Collapse Evaluation 35 3.3 BiTrackGAN 36 3.3.1 Network Architecture 36 3.3.2 CycleGAN 37 3.3.3 Bottom-Up Training 38 3.3.4 Constraint Mechanism 38 3.4 Experiments 40 3.4.1 System Setup 40 3.4.2 Data Preparation 41 3.4.3 Implementation Details 41 3.4.4 Qualitative Comparison 43 3.4.4.1 Reasonable Aging Effect with Constraint Mechanism 43 3.4.4.2 Diversity 46 3.4.4.3 Analysis of Loss Curve 46 3.4.5 Quantitative Analysis 48 3.5 Conclusion 50 Chapter 4 Contextual Facial Aging Image Synthesis 51 4.1 Introduction 51 4.2 Related Works 54 4.3 Preliminary 56 4.3.1 Random Forest 56 4.3.2 Similarity Evaluations 57 4.3.2.1 LPIPS 58 4.3.2.2 MediaPipe Face Landmarker 59 4.4 Contextual Facial Aging Image Synthesis with Fine-Grained Multimodal Diffusion Guidance 59 4.4.1 Facial Features Extraction and Similarity Evaluation 60 4.4.1.1 Metrics 61 4.4.1.2 Similarity Evaluation using KFS (Random Forest Model) 61 4.4.2 Semantic Diffusion Guidance 64 4.5 Experiments 65 4.5.1 System Setup 65 4.5.2 Data Preparation 66 4.5.3 Train Random Forest Model 67 4.5.4 Time for Model Training and Execution 68 4.5.5 Comparison of Classification Performance 70 4.5.6 Ablation Study 71 4.5.7 User Study 73 4.6 Conclusion 74 Chapter 5 Conclusion 75 5.1 Conclusion 75 5.1.1 Automatic Colonoscopy Labeling 75 5.1.2 Constraint Facial Aging Image Synthesis 77 5.1.3 Contextual Facial Aging Image Synthesis 78 5.2 Future Works 79 5.2.1 Automatic Colonoscopy Labeling 79 5.2.2 Constraint Facial Aging Image Synthesis 79 5.2.3 Contextual Facial Aging Image Synthesis 79 References 81	-
dc.language.iso	en	-
dc.title	情境式影像合成以多模態引導方法	zh_TW
dc.title	Contextual Image Synthesis with Multimodal Guidance	en
dc.type	Thesis	-
dc.date.schoolyear	111-2	-
dc.description.degree	博士	-
dc.contributor.coadvisor	胡京通	zh_TW
dc.contributor.coadvisor	Jingtong Hu	en
dc.contributor.oralexamcommittee	顏嗣鈞;徐慰中;施吉昇;洪士灝	zh_TW
dc.contributor.oralexamcommittee	Hsu-Chun Yen;Wei-Chung Hsu;Chi-Sheng Shih;Shih-Hao Hung	en
dc.subject.keyword	自動標註,情境式影像合成,多模態引導,	zh_TW
dc.subject.keyword	Automatic Labeling,Contextual Image Synthesis,Multimodal Guidance,	en
dc.relation.page	85	-
dc.identifier.doi	10.6342/NTU202303822	-
dc.rights.note	未授權	-
dc.date.accepted	2023-08-12	-
dc.contributor.author-college	電機資訊學院	-
dc.contributor.author-dept	資訊網路與多媒體研究所	-
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 目前未授權公開取用	21.79 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。