以低差異領域自適應及特定領域弱注釋實現高效能室內場景解析

Keng-Chi Liu; 劉庚錡

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71493

標題:	以低差異領域自適應及特定領域弱注釋實現高效能室內場景解析 Low Discrepancy Adaptation with Weak Domain-specific Annotations for Efficient Indoor Scene Parsing
作者:	Keng-Chi Liu 劉庚錡
指導教授:	陳良基(Liang-Gee Chen)
關鍵字:	場景解析,自適應,弱監督,領域差異,效率, Scene Parsing,Adaptation,Weak supervision,Domain discrepancy,Efficiency,
出版年 :	2019
學位:	碩士
摘要:	開發能夠基於視覺感知執行類似人類行為的裝置是人工智慧領域的目標，而像素級別上的視覺資訊（例如場景解析）對於這樣的目標應用是有益的，近年來，由於深度學習的發展，這些任務取得了重大進展，然而，效率仍然是一個主要問題，我們提到的術語「效率」指的是數據收集和計算資源需求。由監督方法預測所得的結果雖然效果顯著，但必須依賴於大規模像素級別的標注資料，這是十分耗時且昂貴的，因此，減輕繁重的人力需求成為訓練過程中的關鍵議題。合成數據和弱監督方法被提出用來克服這一挑戰; 不幸的是，前者遭遇到嚴重的域轉移問題，而後者缺乏準確的圖像邊界資訊，此外，大多數現有的弱監督研究只能處理前景突出的「物體」。因此，為了解決這個問題，我們提出一個輔助師生學習框架，通過自適應具有較低領域差異之輔助訊息（例如深度）及特定領域之弱注釋（例如真實外觀）構建的資訊來訓練這種不具轉移性的任務，此後，透過開發的兩階段投票機制，可以有效地將這種不完美信息整合起來。從推論階段的角度來看，計算資源的需求一直是最主要問題，典型的神經網路運行時需要大的記憶體和使用32位元浮點數計算，此外，上述的問題與僅具有幾個類別輸出的分類網絡不同，輸出需與輸入有相對應的關係包含維度和位置，這將耗費更多資源並且可能無法使用現有文獻所提供的方法來優化，然而大多數的研究仍致力於分類網絡上。在本論文中，考慮到現實世界應用的實用性和必要性，我們的目標是設計高效能的場景解析演算法，須同時考量到標注資料的需求量、運算複雜度和性能。首先，通過對損失函數引入最小-最大歸一化，深度資訊得以減少室內場景的領域差異，此外，我們以現實到生成的重建生成器實現無監督感測器深度圖恢復。其次，我們通過深度自適應輔助師生學習以及特定領域的弱監督訊息提出了場景解析的演算法架構，我們基於兩階段整合機制提供損失函數訓練網路，以便產生更準確的結果。本論文所得之方法在評價函數mIoU方面的表現優於目前最佳的適應方法14.63%。最後，我們介紹了一種量化高效場景解析框架的方法，可以在只有1.8％的mIoU損失情況下，將模型大小減小21.9倍和激活值減小8.2倍。 Developing autonomous mobile agents that can perform behaviors like human based on their visual perception is an goal in the field of artificial intelligence and pixel-wise visual cues such as scene parsing are beneficial to such high-level applications. Significant improvement for these tasks have been made recent years due to the evolution of deep learning. Nevertheless, in addition to accuracy, efficiency remains a major issue. The term “efficiency” we have mentioned refers to both data collection and computational complexity. Remarkable scene parsing results made by supervised methods rely on numerous pixel-level annotations, which are time-consuming and expensive to obtain. Hence, to alleviate such cumbersome manual effort becomes a crucial issue during training procedure. Synthetic rendered data and weakly-supervised methods have been explored to overcome this challenge; unfortunately, the former suffer from severe domain shift, the latter with imprecise information. Moreover, majority of existing researches for weak supervision are only capable of handling foreground salient “things”. Hence, to address the issue, we employ an auxiliary teacher-student learning framework to train such untransferable task through pseudo-ground truths constructed by adapting auxiliary cues with lower domain discrepancy (e.g. depth) and leveraging domain-specific information (e.g. real appearance) in weak form. Thereafter, this imperfect information can be integrated effectively by developing a two-stage voting mechanism. From inference phase perspective, complexity has been the main issue for edge computing all the while. A typical network requires large run-time memory and 32-bit floating point computation. Furthermore, unlike general classification networks with only several category outputs,the hourglass network output is the same size and dimension as the input, which cost more resources. However, most of the previous researches focused on classification networks. In this thesis, considering the practicality as well as necessity of real world applications, our goal is to develop a “efficient” scene parsing algorithm with focus on three objectives: labeling, complexity, performance. First, it is shown that depth diminish more domain discrepancy for indoor scenes by introducing min-max normalization to the loss function. Additionally, we argue that the generator for real-to-sim reconstruction is capable of performing unsupervised sensor depth map restoration. Second, a scene parsing framework is proposed by performing auxiliary teacher-student learning with depth adaptation as well as domain-specific weak supervision information. We train a network based on the loss function that penalizes predictions disagreeing with the highly confident pseudo-ground truths provided by a two-stage integration mechanism so as to produce more accurate segmentations. The proposed method turns out to outperform the state-of-the-art adaptation method by 14.63% in terms of mean Intersection over Union (mIoU). Lastly, we extend the existing method to quantize the target lightweight scene parsing network into ternary weights and low bit-width activations (3-4 bits), which can reduce the model size to 21.9X and activation size to 8.2X smaller with only 1.8% mIoU loss.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71493
DOI:	10.6342/NTU201900349
全文授權:	有償授權
顯示於系所單位：	電子工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-108-1.pdf 目前未授權公開取用	42.56 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。