基於特徵金字塔神經網路應用於自然場景文字識別

劉國禎; Guo-Jhen Liou

Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91393

Title:	基於特徵金字塔神經網路應用於自然場景文字識別 Text Spotting in Natural Scenes Based on Feature Pyramid Neural Network
Authors:	劉國禎 Guo-Jhen Liou
Advisor:	黃乾綱 Chien-Kang Huang
Keyword:	電腦視覺,場景文字,文字檢測,文字識別,卷積神經網路, computer vision,scene text,text detection,text recognition,convolutional neural network,
Publication Year :	2022
Degree:	碩士
Abstract:	近年來，由於自然場景文字辨識(STR)任務，有著其諸多的應用，例如文件、電子化、圖像搜索、智能偵測、機器人導航，因此為電腦視覺中熱門的研究領域。但由於STR任務複雜的圖像背景、各種字體和不完善的成像條件等許多因素，STR仍然具有極大的挑戰性。早期的研究主要由手工提取特徵，這些特徵有時可能會限制辨識性能。隨著近年來深度學習的興起，深度學習神經網路在STR任務中，也有了顯著的進步。隨然過去已有許多研究，提出了各式各樣的模型架構，但由於自然場景中有著各種複雜多變性，因此各界對於提升偵測與辨識任務，仍沒有一個足夠完善的算法，能顧及所有場景中的效能表現。本研究分別針對文字偵測與辨識任務，進行模型優化策略分析及評估。在文字偵測模型中，採用EAST模型為基底，優化前端特徵提取的骨幹部分(Backbone)，和網路中段的特徵融合部分(Neck)。在文字辨識模型中，則是採用SRN模型為基底，優化前端特徵提取的骨幹部分(Backbone)，以提升其效能。在端到端的架構整合中，則是另外以MobileNetV3為基底，訓練出一個文字方向分類器，以達到辨識直向文字的目標。實驗結果顯示，經過本研究改良，在偵測任務中，和原EAST模型比較，可將準確率(Precision)提升6.9%，召回率(Recall)可提升2.3%，F度量值(F-measure)可提升4.6%。在辨識任務中，和原SRN模型比較，則是可將準確度(Accuracy)提升8.8%，並將歸一化編輯距離(Normalized Edit Distance)提升9.7%。最後，本研究也將兩項任務整合成端到端的系統架構，解決了中文字中直式書寫的問題，使演算法更具有實用價值。 In recent years, the natural scene text recognition (STR) task has been a popular research field in computer vision due to its many applications, such as document digitization, image search, intelligent detection, and robot navigation. However, STR remains extremely challenging due to many factors such as its complex background, various fonts, and imperfect imaging conditions. Earlier studies mainly relied on hand-crafted features, which often limited the recognition performance. In recent years, with the rise and development of deep learning, deep learning neural networks have made significant progress in STR tasks. There have been many studies in the past and various model architectures have been proposed. However, due to the complexity and variability of natural scenes, there is still no comprehensive enough algorithm for improving detection and recognition tasks for all scenarios. This study analyzes and evaluates model optimization strategies for text detection and recognition tasks respectively. In the text detection model, the EAST model is used as the base to optimize the backbone of the front-end feature extraction, and optimize the feature fusion part of the middle part of the network (neck). In the text recognition model, the SRN model is used as the base to optimize the backbone of front-end feature extraction. In the end-to-end architecture integration, MobileNetV3 is used as the base to train a text orientation classifier to achieve the goal of recognizing straight text. The experimental results show that after the improvement of this study, in the detection task, compared to the original EAST model, the precision can be increased by 6.9%, the recall rate can be increased by 2.3%, and the F-measure can be increased by 4.6%. In the identification task, compared to the original SRN model, the v accuracy can be increased by 8.8%, and the normalized edit distance can be increased by 9.7%. Finally, this research also integrates the two tasks into an end-to-end system architecture, which solves the problem of straight writing in Chinese characters and makes the algorithm more practical.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91393
DOI:	10.6342/NTU202400011
Fulltext Rights:	同意授權(全球公開)
Appears in Collections:	工程科學及海洋工程學系

Files in This Item:

File	Size	Format
ntu-112-1.pdf	7.54 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets