高效率分割網路：一個針對實時語義分割所設計的高效率深度學習模型

王群博; Cyun-Bo Wang

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91741

標題:	高效率分割網路：一個針對實時語義分割所設計的高效率深度學習模型 EffSegmentNet: An Efficient Deep Learning Model for Real-time Semantic Segmentation
作者:	王群博 Cyun-Bo Wang
指導教授:	丁建均 Jian-Jiun Ding
關鍵字:	語義分割,卷積,編碼器,解碼器,多頭自注意力,前饋神經網路, semantic segmentation,convolution,encoder,decoder,multi-head self-attention,feedforward neural network,
出版年 :	2023
學位:	碩士
摘要:	語義分割(Semantic Segmentation)是電腦視覺領域中一門重要的課題，它是針對圖像中的每一個像素去做分類以分割出影像中不同的類別。舉例來說，當一張圖裡有三隻貓與兩隻狗時，那麼語義分割在做的事情就是將圖裡三隻貓的像素都判定為貓這個類別，然後將圖裡兩隻狗的像素都判斷成狗這個類別。語義分割的應用很廣泛，它能被應用在自駕車，醫學影像和工業檢測等等。而現今語義分割大多是採用深度學習的方法來做，語義分割的深度學習架構通常是由編碼器(Encoder)跟解碼器(Decoder)所組成，編碼器的功能是用來萃取影像裡的訊息，而解碼器就是利用編碼器所萃取的影像信息來產出分割結果。最初的語義分割深度學習架構是由大量的卷積層所組成，卷積層的功用是它能有效地去找尋影像裡的重要資訊，因此卷積式的深度學習架構在語義分割這個領域取得很好的效果，而關於這方面的研究也如雨後春筍般湧現，然而卷積式的深度學習架構仍然有其不足的地方，儘管它在萃取影像局部資訊的能力很強，然而受限於感受野不大的問題，它無法有效地去萃取影像的全局資訊。在2020年，原先被用於自然語言處理的Transformer深度學習架構開始被應用在電腦視覺的領域，Transformer式的深度學習架構相比於卷積式的深度學習模型的優勢在於它獲取全局影像資訊的能力更強，因此這樣的架構也在電腦視覺領域取得巨大的成功，而這當中也包含了語義分割。Transformer的編碼器是由很多個模塊組成，而每一個模塊都包含多頭自注意力(multi-head self-attention)跟前饋神經網路(feedforward neural network)，多頭自注意力能有效地讓tokens之間去互相交流以獲取影像的全局資訊，這邊的tokens是指二維影像的所有image patches，而前饋神經網路則是將自注意力機制所獲得的影像全局資訊去做不同通道間的訊息交換，如此一來就能獲取豐富的影像資訊。然而Transformer的架構有一個問題是多頭自注意力的計算量很龐大，它的計算複雜度是tokens數目(以N表示)的平方(N^2)，基於這個問題有些學者嘗試著將多頭自注意力的計算複雜度壓至Ｎ，取得了相當不錯的成果。而有另一篇研究指出Transformer之所以能夠成功的重點在於它的模塊的整體架構，也就是讓tokens之間能互相交流然後也交換不同通道間的訊息，因此注意力機制可能是非必要的，只要能適當設計能混合tokens訊息的token mixer，那麼就能取得相當好的結果了，而由這類模塊所構成的架構就被稱為MetaFormer。在此篇論文，我們設計了一個兼顧準確度與推理速度的輕量級實時(Real-Time)語義分割模型，我們所設計的編碼器是多階層式的，多階層式的編碼器是指編碼器會去萃取不同尺度的影像訊息包含細節的空間資訊以及大範圍的上下文資訊。而基於MetaFormer的概念，我們所設計的token mixer並不只侷限於注意力機制，在擷取小範圍資訊上，我們會使用卷積式的token mixer來節省運算時間，另一方面，我們提出global token mixer 用來以有效率的方式萃取全局資訊。一個global token mixer是由前饋式神經網路跟注意力機制所組成，在前饋式神經網路方面我們以分組的方式來讓各分組的tokens之間能有效地交流訊息，接著我們再以注意力機制來讓各分組之間能互相交流以獲取全局資訊。在解碼器方面，我們提出輕量化卷積模型來整合編碼器所萃取的空間資訊和上下文資訊以解碼出分割結果，以上就是我們所設計的實時語義分割模型的架構。 Semantic segmentation is an important task in the field of computer vision, where each pixel in an image is classified to segment different classes. For example, in an image containing three cats and two dogs, semantic segmentation involves labeling all pixels of the cats as the “cat” class and all pixels of the dogs as the “dog” class. Semantic segmentation has wide-ranging applications in areas such as autonomous driving, medical imaging, and industrial inspection. Currently, deep learning methods are commonly used for semantic segmentation. The typical architecture of a deep learning model for semantic segmentation consists of an encoder and a decoder. The encoder is responsible for extracting information from the image, while the decoder uses the extracted information to generate the semantic segmentation result. Initially, semantic segmentation models were composed of numerous convolutional layers, which effectively captured important information in the image. Consequently, convolutional-based deep learning architectures achieved great success in semantic segmentation, and numerous studies emerged in this field. However, convolutional-based architectures have limitations in capturing global information due to their limited receptive field. In 2020, the Transformer architecture, originally developed for natural language processing, began to be applied in the field of computer vision. The advantage of the Transformer-based architectures over convolutional-based models is its ability to capture global image information more effectively. As a result, Transformer-based architectures have achieved significant success in the field of computer vision, including semantic segmentation. The Transformer encoder consists of multiple blocks, each containing a multi-head self-attention and a feedforward neural network. The multi-head self-attention enables tokens, which represent 2D image patches, to communicate with each other and capture global image information. The feedforward neural network facilitates information exchange across different channels based on the global information obtained from the multi-head self-attention mechanism. This architecture enables the extraction of rich image information. However, the multi-head self-attention in the Transformer has a high computational complexity, with a computational complexity of N^2, where N represents the number of tokens. To address this issue, researchers have proposed methods to reduce the computational complexity of the multi-head self-attention to N, achieving pretty good results. Another study suggests that the key to the success of Transformers lies in the overall architecture of their block design, which allows tokens to interact with each other and exchange information across different channels. Therefore, the attention mechanism may not be necessary. As long as a token mixer is appropriately designed to blend the information of tokens, satisfactory results can be achieved. This type of architecture, composed of such blocks, is referred to as MetaFormer. In this master thesis, we propose a lightweight real-time semantic segmentation model that considers both accuracy and inference speed. The elaborately-designed encoder follows a hierarchical structure, where it extracts image information at different scales, including fine-grained spatial details and long-range contextual information. Based on the concept of MetaFormer, our token mixer design goes beyond the limitation of the attention mechanism. To capture local information, we utilize the convolutional token mixers to reduce computational time. To extract global information efficiently, we propose a global token mixer that combines the feedforward neural network and the attention module. In the feedforward neural network, we employ grouping to facilitate effective communication among tokens within each group. Subsequently, we utilize the attention module to enable communication between different groups for obtaining global information. In the decoder, we design a lightweight convolutional-based structure to aggregate the spatial information and the contextual information extracted by the encoder to obtain the semantic segmentation result.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91741
DOI:	10.6342/NTU202400327
全文授權:	同意授權(全球公開)
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-112-1.pdf	9.88 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。