以自然語言處理輔助法律判決書之探勘

Kuan-Lin Chen; 陳冠霖

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86453

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	項潔(Jieh Hsiang)
dc.contributor.author	Kuan-Lin Chen	en
dc.contributor.author	陳冠霖	zh_TW
dc.date.accessioned	2023-03-19T23:56:44Z	-
dc.date.copyright	2022-08-22
dc.date.issued	2022
dc.date.submitted	2022-08-18
dc.identifier.citation	[1] 中華民國司法院. 量刑趨勢建議系統, 2018. [2] 黃志揚. 基於法律判決書之公司倒閉風險評估. 2021. [3] Andreas Hotho, Andreas Nürnberger, and Gerhard Paaß. A brief survey of text mining. Citeseer, 2005. [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [5] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011. [6] Scikit-Learn. Scikit-learn/agglomerative.py at main ·scikit-learn/scikit-learn, Feb 2022. [7] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008. [8] Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web, pages 141–150, 2007. [9] Fxsjy. Fxsjy/jieba: 結巴中文分詞, Feb 2020. [10] APCLab. Apclab/jieba-tw: 結巴中文斷詞台灣繁體版本, Nov 2017. [11] 中華民國司法院. 司法院法學資料檢索系統. 2019. [12] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019. [13] Ckiplab. Ckiplab/ckip-transformers: Ckip transformers, May 2022. [14] Giancarlo Perrone, Jose Unpingco, and Haw-minn Lu. Network visualizations with pyvis and visjs. arXiv preprint arXiv:2006.04951, 2020.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86453	-
dc.description.abstract	法律判決書資料庫，記錄了從民國 85 年迄今共 1580 萬篇法律判決書，檔案大小約 120GB (截自本研究實驗時)，如此龐大的數量除了使得傳統的研究方式難以進行外，也令許多研究者投入資訊科技輔助的研究。本次研究亦是如此，實驗方向在於當一個閱讀者在看某篇判決時，嘗試根據使用者的關注點去給予錨點，令其探勘到新的判決。核心為設計機制來將「探勘合適度」數據化，兩個判決之間的關聯可能是直接明顯相關，如判決直接引用了另篇判決，除了找出判決外，實驗會嘗試找出額外的「推薦值」來告知使用者，也可能是無法從內文直接得知的，如兩篇判決參與者相同，或彼此的判決原由高度相似...等。藉著此類設計，來完成在數個假想情況下，輔助使用者可以跳脫單純眼前的判決書，來觀察其後面的脈絡。本次使用的是藍星球科技的法律判決書資料庫，希望透過資訊科學的角度，構建由法律判決書的關聯特性探勘機制，但是，目前沒有任何指標資訊能代表「探勘合適度」，而根據任務目的來額外標記，數量、法律用詞的特殊性、任務標記所需的專業性...等，另設標記十分困難。研究的難點便是在前述狀況下，以現有的判決書結構及文字資訊，組合 BERT、向量相似性、SimHash、自然語言處理...等資訊方法設計在下述三個面向的探勘機制。一、參與案件的人物資料，便是判決間最直觀的關聯性。在無標記的內文中分離出判決角色資料會出現的區塊後，再對所有判決有關角色進行匯集分析，找出歷代判決書中記錄過的所有角色集合。並使用此資料來在各判決內文中過濾出的參與判決人物和對應角色資訊，接著使用者便可以此當錨點找到同人物的其餘判決書。二、法庭對案件涉及法律和事件性質，來決定判決原由作為判決的概要，即「案由」。因此相似案由的判決也能視為部分意義上的同類。故在此前提下，嘗試利用無監督分群來處理案由集合，以此找出近似案由群集，令輔助使用者能以此當關鍵字找到同類判決。同時針對另一假想情景，使用者對判決或案由處於全盲或不清楚正確案由用字時，可以用任何文句找到最近似且存在的案由集合，來讓使用者得到正確的案由資料，並以此找到需要的判決。三、判決內文中會提及其餘判決，可能為前審或判例。除了單純設計蒐集內文提及的判決案號的文本篩選機制外，從判決的用字結構綜合前述的案由相似度來設計當作推薦分數，以此進行推薦度排序，給予使用者進一步的資料比對依據。同時，對資料庫所有歷史判決文件進行引用次數統計，來當作另一個輔助使用者推了解決的數據。最後綜合前述的研究，結合成應用: 判決引用關聯路徑圖，產生隱藏在判決文中甚至是背後的關聯路徑，令使用者可以快速俯瞰判決後的文件引用脈絡。本次實驗整體偏向研究性質，主要在於攻克各面向的核心技術點，但同時都有提出簡易的快速應用，期許能以此構建輔助使用者探勘法律判決書應用的基礎。	zh_TW
dc.description.abstract	Taiwan’s Legal Judgment Database records 15.8 million judgments from 1996 to date with a size of approximately 120 GB. The amount of data is too huge for traditional research methods to analyze. This thesis proposes a method that gives anchors about a particular connection when the user is reading a verdict, and the user can use the anchors to explore relevant new verdicts. The core is design mechanisms to make the suitability of data mining's result become calculable. The relationship of two verdicts can be directly and obviously relevant, like citing another verdict in text. Besides finding information in content, the experiment will also find additional recommendation value for users. It can also be obscure and unable to search in the inner text, as two verdicts have the same participant, or their cause of action is highly similar … etc. With designs like these, assist users in observing the relationships behind the currently-watching verdict under these several scenarios. Using the database provided by Blue Planet Inc., we hope to construct the data mining method of court judgment relation from aspect of information technology. But there doesn't exist any information that can intuitively represent the suitability of data mining's results on legal judgment. Considering the specificity of legal terms and the professionalism required for manual-marking tasks, additional markers based on the purpose of the task are difficult to implement. Research challenges are using the verdict’s structure and text information to combine BERT, vertex similarity, SimHash, Natural Language Processing to design data mining mechanisms in three aspects below. First, the most straightforward aspect of judgment correlation is the characters involved. The first step is to collect and analyze all the judgment-related characters from the unmarked text. After finding out which characters have been recorded in all judgments, use this to separate the characters involved in each judgment and corresponding characters, so that users can find the judgments with the same characters using this information. Second, each judgment has a corresponding cause of action, which usually represents an outline of the used laws and types of event, so two verdicts with similar causes of action can also be considered as partially similar. Using unsupervised clustering in causes of action's similarity, identify close clusters and use them as a keyword to assist users in finding similar judgments. Also, design a separate function to help users when they are unclear about the correct wording, with any sentence inputted, it will return a close cause of actions' set. Third, the content of the judgment will cite the other judgment, besides collecting thesis data, designing a sorting algorithm based on the similarity of action's cause and term structure, and also conducting references' number statistics on all documents in the database. Finally, constructing all previous functions is to build a relationship path map, allowing users to quickly get an overview of the citation relationships following this judgment. This experiment is biased toward research-based, in addition to tackling the above mentioned three core technical points, it also proposes some simple and quick applications, which are expected to form the basis for assisting users in exploring the application of legal judgements.	en
dc.description.provenance	Made available in DSpace on 2023-03-19T23:56:44Z (GMT). No. of bitstreams: 1 U0001-1206202222091900.pdf: 5250270 bytes, checksum: 94c4999d402be4cb6da7ae6af8aaf0c5 (MD5) Previous issue date: 2022	en
dc.description.tableofcontents	誌謝 iii 摘要 v Abstract vii 1 緒論 1 1.1 背景與目的 1 1.2 論文架構 3 2 研究方法 5 2.1 BERT 5 2.2 聚合式分群法 (Agglomerative Clustering) 6 2.3 向量餘弦相似度 9 2.4 t 分佈隨機鄰近插入 10 2.5 SimHash 11 2.5.1 區域性性敏感雜湊 12 2.5.2 漢明距離 12 2.6 Jieba 13 2.6.1 Jieba-TW 13 3 研究資料 15 3.1 法律判決書資料庫 15 3.2 法律判決書正文 17 4 判決角色與人物提取 19 4.1 角色區塊區分 21 4.2 判決角色彙集 23 4.3 角色與人物提取機制 27 4.4 提取範例 29 5 案由相似性 33 5.1 案由文本處理 34 5.2 案由無監督式分群 35 5.3 分群結果範例 42 5.4 最近似案由集合機制 48 5.5 最近似集合範例 50 6 判決關聯性 55 6.1 內文關聯判決辨識 56 6.2 關聯判決推薦演算法 58 6.3 判決深度遍歷 60 6.4 關聯判決範例 62 6.5 判決關聯路徑與範例 63 7 後續討論及延伸研究方向 69 7.1 成果討論 69 7.2 研究限制 71 7.3 可持續研究的方向性 72 Bibliography 75 Appendices 77 .1 法院最短可辨識名稱與對應編號表 79 .2 關聯判決辨識前 150 引用數表 84
dc.language.iso	zh-TW
dc.subject	文字探勘	zh_TW
dc.subject	法律判決書	zh_TW
dc.subject	文字探勘	zh_TW
dc.subject	自然語言處理	zh_TW
dc.subject	階層式聚合式分群法	zh_TW
dc.subject	BERT	zh_TW
dc.subject	SimHash	zh_TW
dc.subject	SimHash	zh_TW
dc.subject	BERT	zh_TW
dc.subject	法律判決書	zh_TW
dc.subject	階層式聚合式分群法	zh_TW
dc.subject	自然語言處理	zh_TW
dc.subject	Natural Language Processing	en
dc.subject	Court Judgment	en
dc.subject	Text Mining	en
dc.subject	Agglomerative Hierarchical Clustering	en
dc.subject	BERT	en
dc.subject	SimHash	en
dc.subject	Court Judgment	en
dc.subject	Text Mining	en
dc.subject	Natural Language Processing	en
dc.subject	Agglomerative Hierarchical Clustering	en
dc.subject	BERT	en
dc.subject	SimHash	en
dc.title	以自然語言處理輔助法律判決書之探勘	zh_TW
dc.title	Assisting Data Mining of Court Judgments with Natural Language Processing	en
dc.type	Thesis
dc.date.schoolyear	110-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	謝育平(Yuh-Pyng Shieh),胡其瑞(Chi-Jui Hu)
dc.subject.keyword	法律判決書,文字探勘,自然語言處理,階層式聚合式分群法,BERT,SimHash,	zh_TW
dc.subject.keyword	Court Judgment,Text Mining,Natural Language Processing,Agglomerative Hierarchical Clustering,BERT,SimHash,	en
dc.relation.page	89
dc.identifier.doi	10.6342/NTU202200925
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2022-08-18
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊網路與多媒體研究所	zh_TW
dc.date.embargo-lift	2022-08-22	-
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
U0001-1206202222091900.pdf	5.13 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。