古籍影像與文本之對應－以《古今圖書集成》為例

Kuan-Chung Chen; 陳冠仲

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/4528

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	項潔
dc.contributor.author	Kuan-Chung Chen	en
dc.contributor.author	陳冠仲	zh_TW
dc.date.accessioned	2021-05-14T17:43:01Z	-
dc.date.available	2015-08-20
dc.date.available	2021-05-14T17:43:01Z	-
dc.date.copyright	2015-08-20
dc.date.issued	2015
dc.date.submitted	2015-08-12
dc.identifier.citation	[1] 維基百科－古今圖書集成，Available: http://zh.wikipedia.org/wiki/古今图书集成 [2] 裴芹，〈《古今圖書集成》研究〉，2001年 [3] 古今圖書集成索引＆全書圖像，Available: http://gjtsjc.gxu.edu.cn/ [4] 數位古今圖書集成，Available: http://192.83.187.228/gjtsnet/index.htm [5] THDL-based 古今圖書集成，Available: thdl.csie.org/L303_GuJinTuShuJiCheng/ [6] 蔡孟竹、曾元顯，〈中文OCR文件檢索測試集之製作與應用〉，「教育資料與圖書館學」，第40卷，第3期，2003年3月 [7] 丁原基，古今圖書集成，Available: http://192.83.187.228/gjtsnet/index.htm [8] 圖書集成經緯目錄，Available: http://gjtsjc.gxu.edu.cn/jwml.aspx [9] 維基百科－霍夫轉換，Available: https://en.wikipedia.org/wiki/Hough_transform [10] 林易徵，〈《古今圖書集成》自動化內容建構與出處擷取〉，碩士論文，國立台灣大學，2013年 [11] OpenCV官方網站，Available: http://opencv.org/ [12] 陳夢雷原著，蕭孟能編印，《古今圖書集成及索引》第001冊，文星書店，1964年，p.2-3 [13] 陳夢雷原著，蕭孟能編印，《古今圖書集成及索引》第001冊，文星書店，1964年，p.3 [14] 胡道靜〈《古今圖書集成》的情況、特點及其作用〉，1962年
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/4528	-
dc.description.abstract	《古今圖書集成》為現存最大類書，因此有不少數位人文學者將其與資料庫系統結合，做成《古今圖書集成》全文檢索系統，內容大多包含文字及影像的搜索功能，但在結果的呈現上皆重於文字，對影像的部分並無多加著墨，所以當使用者想從影像中獲取一些資訊，例如找某個關鍵字詞時，只能用肉眼觀察影像的內容，無法從系統提供幫助。在本研究中，試圖避開OCR技術的輔助，直接對影像及文本處理，讓兩者間有高度的對應關係，再利用文本來尋找文字在影像中的位置。首先對所有影像做一些影像處理，包含了旋轉與切割，使每張影像有著相同的格式與排版，再分析影像特性，如：文字的排版方式、影像中圖像有固定大小與位置等等，利用這些特性以行為單位將影像的狀態完整對應到文本中，最後文本每一行對應到影像中文字、空行、圖像三種狀態其一。最後再利用對應完成的文本及處理過的影像，先計算文字在文本中的位置，再透過對應座標的方式找出文字在影像中的位置。如此使得《古今圖書集成》影像將不再只是以插圖的形式點綴系統，而是能實際提供有用的資訊給使用者。	zh_TW
dc.description.abstract	The Complete Collection of Graphs and Writings of Ancient and Modern Times (Gujintushujicheng, or Jicheng for short), completed in the early 18th century, is the largest book in the world in existence. Containing over one million Chinese characters, almost 100,000 pages, and cover over 6,000 subjects, Jicheng is also difficult to use. During the past decade, several digital systems have been developed so that people can use Jicheng through fulltext search. However, all of these system did not attempt to match images and texts, which would make using Jicheng even easier. This difficult arises partly because for old Chinese books, OCR is still not an effective technology. In this thesis we develop a method that tries to find direct correspondence between an image of Jicheng and its associated text without resorting to OCR. We first calibrate the images so that all 100,000 pages in the book have the same size and format. We then analyze the characteristics such as the format, number of lines, position of graphs, etc, so that each line in the typed text maps to either a line of text, a blank line, of part of a graph in a page image. Once this is done, we then do a character-by-character mapping between each character in the typed text and a character in a page image. Our method is quite effective. The accuracy in mapping the entire contain of Jicheng is 98,7%. The rest is mainly due to typographic errors occurred when typing the full text, which can be easily corrected by hand.	en
dc.description.provenance	Made available in DSpace on 2021-05-14T17:43:01Z (GMT). No. of bitstreams: 1 ntu-104-R99922126-1.pdf: 7078673 bytes, checksum: a75e079c5695ff97b18befbeba5e961e (MD5) Previous issue date: 2015	en
dc.description.tableofcontents	誌謝 i 中文摘要 ii ABSTRACT iii CONTENTS iv LIST OF FIGURES vi LIST OF TABLES ix Chapter 1 緒論 1 1.1 研究背景 1 1.2 研究目的 2 1.3 相關研究 3 1.3.1 廣西大學─「古今圖書集成索引＆全書圖像」 3 1.3.2 故宮&東吳─「數位古今圖書集成」 6 1.4 論文架構 9 Chapter 2 研究資料介紹 10 2.1 《古今圖書集成》 10 2.1.1 簡介 10 2.1.2 版本介紹 10 2.1.3 集成目錄 11 2.2 《集成》數位化影像資料 12 2.3 《集成》數位化文字資料 14 Chapter 3 《集成》影像標準化處理 17 3.1 偵測影像斜率 19 3.2 旋轉影像 22 3.3 文字區塊切割 23 Chapter 4 《集成》文字檔處理 26 4.1 去除多餘資訊 27 4.1.1 標點符號去除 27 4.1.2 重複標題去除 27 4.2 保留原書資訊 28 4.2.1 內文及小字 28 4.2.2 影像檔名處理 29 4.2.3 稀有字處理 29 4.2.4 圖像處理 31 4.3 補上空行資訊 34 Chapter 5 文字位置計算 39 Chapter 6 結論與未來工作 41 REFERENCE 42
dc.language.iso	zh-TW
dc.subject	影像處理	zh_TW
dc.subject	古今圖書集成	zh_TW
dc.subject	數位人文	zh_TW
dc.subject	Gujintushujicheng	en
dc.subject	Image Processing	en
dc.subject	Digital Humanities	en
dc.title	古籍影像與文本之對應－以《古今圖書集成》為例	zh_TW
dc.title	Mapping between Images and Texts of Completed Collection of Graphs and Writings of Ancient and Modern Times	en
dc.type	Thesis
dc.date.schoolyear	103-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	謝育平,蔡宗翰
dc.subject.keyword	古今圖書集成,數位人文,影像處理,	zh_TW
dc.subject.keyword	Gujintushujicheng,Digital Humanities,Image Processing,	en
dc.relation.page	42
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2015-08-12
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-104-1.pdf	6.91 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。