以自然語言處理輔助法律判決書之探勘

Kuan-Lin Chen; 陳冠霖

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86453

標題:	以自然語言處理輔助法律判決書之探勘 Assisting Data Mining of Court Judgments with Natural Language Processing
作者:	Kuan-Lin Chen 陳冠霖
指導教授:	項潔(Jieh Hsiang)
關鍵字:	法律判決書,文字探勘,自然語言處理,階層式聚合式分群法,BERT,SimHash, Court Judgment,Text Mining,Natural Language Processing,Agglomerative Hierarchical Clustering,BERT,SimHash,
出版年 :	2022
學位:	碩士
摘要:	法律判決書資料庫，記錄了從民國 85 年迄今共 1580 萬篇法律判決書，檔案大小約 120GB (截自本研究實驗時)，如此龐大的數量除了使得傳統的研究方式難以進行外，也令許多研究者投入資訊科技輔助的研究。本次研究亦是如此，實驗方向在於當一個閱讀者在看某篇判決時，嘗試根據使用者的關注點去給予錨點，令其探勘到新的判決。核心為設計機制來將「探勘合適度」數據化，兩個判決之間的關聯可能是直接明顯相關，如判決直接引用了另篇判決，除了找出判決外，實驗會嘗試找出額外的「推薦值」來告知使用者，也可能是無法從內文直接得知的，如兩篇判決參與者相同，或彼此的判決原由高度相似...等。藉著此類設計，來完成在數個假想情況下，輔助使用者可以跳脫單純眼前的判決書，來觀察其後面的脈絡。本次使用的是藍星球科技的法律判決書資料庫，希望透過資訊科學的角度，構建由法律判決書的關聯特性探勘機制，但是，目前沒有任何指標資訊能代表「探勘合適度」，而根據任務目的來額外標記，數量、法律用詞的特殊性、任務標記所需的專業性...等，另設標記十分困難。研究的難點便是在前述狀況下，以現有的判決書結構及文字資訊，組合 BERT、向量相似性、SimHash、自然語言處理...等資訊方法設計在下述三個面向的探勘機制。一、參與案件的人物資料，便是判決間最直觀的關聯性。在無標記的內文中分離出判決角色資料會出現的區塊後，再對所有判決有關角色進行匯集分析，找出歷代判決書中記錄過的所有角色集合。並使用此資料來在各判決內文中過濾出的參與判決人物和對應角色資訊，接著使用者便可以此當錨點找到同人物的其餘判決書。二、法庭對案件涉及法律和事件性質，來決定判決原由作為判決的概要，即「案由」。因此相似案由的判決也能視為部分意義上的同類。故在此前提下，嘗試利用無監督分群來處理案由集合，以此找出近似案由群集，令輔助使用者能以此當關鍵字找到同類判決。同時針對另一假想情景，使用者對判決或案由處於全盲或不清楚正確案由用字時，可以用任何文句找到最近似且存在的案由集合，來讓使用者得到正確的案由資料，並以此找到需要的判決。三、判決內文中會提及其餘判決，可能為前審或判例。除了單純設計蒐集內文提及的判決案號的文本篩選機制外，從判決的用字結構綜合前述的案由相似度來設計當作推薦分數，以此進行推薦度排序，給予使用者進一步的資料比對依據。同時，對資料庫所有歷史判決文件進行引用次數統計，來當作另一個輔助使用者推了解決的數據。最後綜合前述的研究，結合成應用: 判決引用關聯路徑圖，產生隱藏在判決文中甚至是背後的關聯路徑，令使用者可以快速俯瞰判決後的文件引用脈絡。本次實驗整體偏向研究性質，主要在於攻克各面向的核心技術點，但同時都有提出簡易的快速應用，期許能以此構建輔助使用者探勘法律判決書應用的基礎。 Taiwan’s Legal Judgment Database records 15.8 million judgments from 1996 to date with a size of approximately 120 GB. The amount of data is too huge for traditional research methods to analyze. This thesis proposes a method that gives anchors about a particular connection when the user is reading a verdict, and the user can use the anchors to explore relevant new verdicts. The core is design mechanisms to make the suitability of data mining's result become calculable. The relationship of two verdicts can be directly and obviously relevant, like citing another verdict in text. Besides finding information in content, the experiment will also find additional recommendation value for users. It can also be obscure and unable to search in the inner text, as two verdicts have the same participant, or their cause of action is highly similar … etc. With designs like these, assist users in observing the relationships behind the currently-watching verdict under these several scenarios. Using the database provided by Blue Planet Inc., we hope to construct the data mining method of court judgment relation from aspect of information technology. But there doesn't exist any information that can intuitively represent the suitability of data mining's results on legal judgment. Considering the specificity of legal terms and the professionalism required for manual-marking tasks, additional markers based on the purpose of the task are difficult to implement. Research challenges are using the verdict’s structure and text information to combine BERT, vertex similarity, SimHash, Natural Language Processing to design data mining mechanisms in three aspects below. First, the most straightforward aspect of judgment correlation is the characters involved. The first step is to collect and analyze all the judgment-related characters from the unmarked text. After finding out which characters have been recorded in all judgments, use this to separate the characters involved in each judgment and corresponding characters, so that users can find the judgments with the same characters using this information. Second, each judgment has a corresponding cause of action, which usually represents an outline of the used laws and types of event, so two verdicts with similar causes of action can also be considered as partially similar. Using unsupervised clustering in causes of action's similarity, identify close clusters and use them as a keyword to assist users in finding similar judgments. Also, design a separate function to help users when they are unclear about the correct wording, with any sentence inputted, it will return a close cause of actions' set. Third, the content of the judgment will cite the other judgment, besides collecting thesis data, designing a sorting algorithm based on the similarity of action's cause and term structure, and also conducting references' number statistics on all documents in the database. Finally, constructing all previous functions is to build a relationship path map, allowing users to quickly get an overview of the citation relationships following this judgment. This experiment is biased toward research-based, in addition to tackling the above mentioned three core technical points, it also proposes some simple and quick applications, which are expected to form the basis for assisting users in exploring the application of legal judgements.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/86453
DOI:	10.6342/NTU202200925
全文授權:	同意授權(全球公開)
電子全文公開日期:	2022-08-22
顯示於系所單位：	資訊網路與多媒體研究所

文件中的檔案：

檔案	大小	格式
U0001-1206202222091900.pdf	5.13 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。