使用深度學習之標籤光學文字識別

Ping-Yuan Tseng; 曾柄元

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/15325

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	傅楸善(Chiou-Shann Fuh)
dc.contributor.author	Ping-Yuan Tseng	en
dc.contributor.author	曾柄元	zh_TW
dc.date.accessioned	2021-06-07T17:32:55Z	-
dc.date.copyright	2020-08-03
dc.date.issued	2020
dc.date.submitted	2020-07-14
dc.identifier.citation	[1] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, J. Liang, “EAST: An Efficient and Accurate Scene Text Detector,” https://arxiv.org/pdf/1704.03155.pdf, 2017. [2] M. Liao, B. Shi, X. Bai, “TextBoxes++: A Single-Shot Oriented Scene Text Detector,” https://arxiv.org/pdf/1801.02765.pdf, 2018. [3] Z Tian , W Huang, T He , P He , and Y Qiao, “Detecting Text in Natural Image with Connectionist Text Proposal Network,” https://arxiv.org/pdf/1609.03605.pdf, 2016. [4] W Liu , D. Anguelov , D. Erhan , C. Szegedy , S. Reed, C. Y. Fu , and Alexander C. Berg, “SSD: Single Shot MultiBox Detector, ”https://arxiv.org/pdf/1512.02325.pdf, 2016. [5] S. Q. Ren, K. M. He, R. Girshick, and J. Sun, “An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition,” https://arxiv.org/pdf/1507.05717.pdf, 2015. [6] S. Hochreiter, F. F. Informatik, J. Schmidhuber, IDSIA,“LONG SHORT-TERM MEMORY,” https://www.bioinf.jku.at/publications/older/2604.pdf, 1997. [7] F. Borisyuk, A. Gordo, V. Sivakumar, “Rosetta: Large scale system for text detection and recognition in images,” https://research.fb.com/wp-content/uploads/2018/10/Rosetta-Large-scale-system-for-text-detection-and-recognition-in-images.pdf, 2018. [8] S. Q. Ren, K. M. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” https://arxiv.org/pdf/1506.01497.pdf, 2016. [9] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” https://arxiv.org/pdf/1505.04597.pdf, 2015. [10] L. C. Chen, G Papandreou, L. Kokkinos, K Murphy, A L. Yuille, “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,” https://arxiv.org/pdf/1606.00915.pdf, 2017. [11] ICDAR, ”ICDAR 2019 Robust Reading Challenge on Multi-lingual scene text detection and recognition,” https://rrc.cvc.uab.es/?ch=15, 2019 [12] 阿里雲, “ICPR MTWI 2018 挑戰賽,” https://tianchi.aliyun.com/competition/entrance/231651/introduction, 2018 [13] J. Redmon , S. Divvala, R. Girshick, A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” https://arxiv.org/pdf/1506.02640.pdf, 2016. [14] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” https://arxiv.org/pdf/1409.1556.pdf, 2015. [15] K. He, et al. “Deep Residual Learning for Image Recognition,” Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, pp. 770-778, 2016. [16] Everingham, M. and Van~Gool, L. and Williams, C. K. I. and Winn, J. and Zisserman, A., “The PASCAL Visual Object Classes Challenge 2012 Results,” http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html, 2012. [17] T. Y. Lin, M. Maire, S. Belongie, et al,“Microsoft COCO: Common Objects in Context,” https://arxiv.org/pdf/1405.0312.pdf, 2015. [18] uoip. “SSD Variants,” https://github.com/uoip/SSD-variants, 2018. [19] Tommy Huang, “機器/深度學習: 物件偵測 Non-Maximum Suppression (NMS),” https://medium.com/@chih.sheng.huang821/%E6%A9%9F%E5%99%A8-%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92-%E7%89%A9%E4%BB%B6%E5%81%B5%E6%B8%AC-non-maximum-suppression-nms-aa70c45adffa, Honolulu, Hawaii, pp. 4700-4708, 2018. [20] A. Graves1 ,S. Fernandez , F. Gomez , J. Schmidhuber, “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” https://www.cs.toronto.edu/~graves/icml_2006.pdf, 2006. [21] T. L. Yuan, Z. Zhu, K. Xu, C. J. Li, T. J. Mu and S. M. Hu, “A Large Chinese Text Dataset in the Wild, ” https://ctwdataset.github.io/, 2018. [22] Wikipedia, “Edit distance,” https://en.wikipedia.org/wiki/Edit_distance, 2020. [23] D. Bolya, C. Zhou, F. Xiao, Y. J. Lee “YOLACT++: Better Real-time Instance Segmentation,” https://arxiv.org/pdf/1912.06218.pdf, 2019
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/15325	-
dc.description.abstract	本論文提出深度學習的解決方案,用於標籤的光學文字辨識。主要分為兩個部分。一、文字行的偵測,我們提出了兩種偵測模型,使用截然不同的方法,第一個是以影像分割為基礎的文字行偵測模型稱之為 Water Pixel,利用模型產生出兩張影像分割,一張為文字正常大小的分割影像,另一張為文字縮小的分割影像,再利用 Watershed 演算法解決文字沾黏的現象,達到實例分割的效果。第二個偵測模型為 CT-SSD,比較常見的深度學習文字偵測演算法,決定採用連接文本提議網路 CTPN (Connectionist Text Proposal Network)的想法,將訓練資料的文字框切小塊作訓練,解決了其他算法不適合偵測長文本的問題,但是CTPN速度並不足以達到我們的需求,因此使用單階段多方框偵測器 SSD (Single-Shot multibox Detector)作為偵測模型,使用 GPU (Graphic Processing Unit)偵測文字框可以達到接近實時的效果,並且非常適合偵測長文本。二、文字行辨識,網路上並沒有公開可獲得的真實正體中文文字行資料集,因此本論文利用合成正體中文文字行訓練卷積遞迴神經網路 CRNN (Convolutional Recurrent Neural Network),並且利用野外簡體中文資料集 CTW (Chinese Text dataset in the Wild),交互訓練 CRNN 的 Encoder,以克服只用合成資料集訓練 CRNN 導致的過擬合問題。並且使用多種資料擴增的方式,如:模糊、拉伸、扭曲、背景變化、抖動、失真等,用以克服不同場景變化。	zh_TW
dc.description.abstract	In this thesis, we propose a deep learning solution for Optical Character Recognition (OCR). OCR can be divided into two main parts. The first part is text line detection: We propose two text-line detection model using totally different methods. The first detection model is Water Pixel which is an image segmentation-based model. Water Pixel will predict two segmentation images: the text segmentation with normal size and the shrunk text segmentation. Then, we use watershed algorithm to solve the text overlapping problem to achieve the effect of instance segmentation. The other detection work we proposed is CT-SSD. After comparing several deep learning-based algorithms, we find out that CTPN (Connectionist Text Proposal Network) can perform well. The main idea of CTPN is to divide the ground truth boxes into small pieces then train with faster-RCNN. CTPN detects long text lines well, but the speed cannot meet our expectation. Thus, we use SSD (Single-Shot multibox Detector) as the detection backbone instead of using Faster-RCNN. CT-SSD not only can detect long text well but also speed up the inference time. The second part is text line recognition: There are no available traditional Chinese text line dataset on the public web. Hence, we train our CRNN (Convolutional Recurrent Neural Network) with synthetic traditional Chinese text lines. In addition, we also utilize the simplified Chinese dataset CTW (Chinese Text dataset in the Wild) to enhance CRNN performance by alternately training the encoder of CRNN. This approach can slightly solve the problem that training with only synthetic data may cause model overfitting. We also adopt many augmentation skills such as blur, shrink and expand, distortion, background variation, shake and aliasing to avoid overfitting in one scenario.	en
dc.description.provenance	Made available in DSpace on 2021-06-07T17:32:55Z (GMT). No. of bitstreams: 1 U0001-0307202015335300.pdf: 5223232 bytes, checksum: e59b1c89fb6fd971a1453f57248f0ae8 (MD5) Previous issue date: 2020	en
dc.description.tableofcontents	誌謝 i 中文摘要 ii ABSTRACT iv CONTENTS vi LIST OF FIGURES ix LIST OF TABLES xiii Chapter 1 Introduction 1 1.1 Overview 1 1.2 Workflow 3 1.3 Thesis Organization 3 Chapter 2 Text Line Detection Background 4 2.1 Faster R-CNN [8] 4 2.2 Single-Shot multibox Detector (SSD) 5 2.3 Connectionist Text Proposal Network (CTPN) 6 2.4 An Efficient and Accurate Scene Text Detector (EAST) 6 Chapter 3 Text Line Detection Model 1: Water Pixel 8 3.1 Introduction 8 3.2 Model Architecture 8 3.3 Training Objective 10 3.4 Workflow 10 3.5 Training Data 14 3.6 Experimental Result 15 Chapter 4 Text Line Detection Model 2: CT-SSD 20 4.1 Introduction 20 4.2 Our Model: CT-SSD 20 4.3 Training Objective 21 4.4 Non-Maximum Suppression 23 4.5 Text-Line Construction Algorithm 24 4.6 Training Data Visualization 29 4.7 Experimental Results 32 Chapter 5 Text Detection Experimental Results 37 Chapter 6 Text-Line Recognition 38 6.1 Task Introduction 38 6.2 Text-Line Recognition Model Architecture 38 6.3 Label Sequence Prediction 41 6.4 Training Data for CRNN 43 6.5 Training Data for Auxiliary Part 45 6.6 Training Details 46 6.7 Evaluation Criterion 47 6.8 Test Result 48 Chapter 7 Specification 51 Chapter 8 Conclusion and Future Works 52 References 53
dc.language.iso	en
dc.subject	單階段多方框偵測器	zh_TW
dc.subject	卷積遞迴神經網路	zh_TW
dc.subject	深度學習	zh_TW
dc.subject	實例分割	zh_TW
dc.subject	連接文本提議網路	zh_TW
dc.subject	Instance Segmentation	en
dc.subject	Deep learning	en
dc.subject	CRNN	en
dc.subject	SSD	en
dc.subject	CTPN	en
dc.title	使用深度學習之標籤光學文字識別	zh_TW
dc.title	Label Optical Character Recognition with Deep Learning	en
dc.type	Thesis
dc.date.schoolyear	108-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	沈裕池(YU-CHI SHEN),賴志宏(ZHI-HONG LAI),沈立健(LI-JIAN SHEN)
dc.subject.keyword	深度學習,實例分割,連接文本提議網路,單階段多方框偵測器,卷積遞迴神經網路,	zh_TW
dc.subject.keyword	Deep learning,Instance Segmentation,CTPN,SSD,CRNN,	en
dc.relation.page	56
dc.identifier.doi	10.6342/NTU202001296
dc.rights.note	未授權
dc.date.accepted	2020-07-15
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
U0001-0307202015335300.pdf 未授權公開取用	5.1 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。