利用語境結構與時間關聯提升影片語意標記準確性之研究

Ming-Fang Weng; 翁明昉

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/46565

標題:	利用語境結構與時間關聯提升影片語意標記準確性之研究 Exploring Contextual and Temporal Relationships for Semantic Video Indexing
作者:	Ming-Fang Weng 翁明昉
指導教授:	莊永裕(Yung-Yu Chuang)
關鍵字:	影片查詢系統,影片語意標記,數位影音內容分析,語境結構,時間關聯, Concept-based video retrieval,Semantic video indexing,Multimedia content analysis,Cross-domain learning,Contextual correlation,Temporal dependency,TRECVID,
出版年 :	2010
學位:	博士
摘要:	由於數位相機、數位攝影機日益普及，大幅提昇了影片取得的便利性。另一方面，影片分享平台像 Youtube 也蓬勃發展，更加速了影片的複製、傳播和交換。在這種情況下，迅速增加的大規模影片衍生出查詢與搜尋的迫切需求。近年來，影片查詢系統普遍採用以語意概念為主的搜尋架構；然而，此項技術卻極度依賴影片之語意標記的準確度。為了提升影片標記的正確性，本論文針對三個關鍵議題進行研究。首先，我們探討如何挖掘語意概念在影片中的語境結構及時間關聯。其次，我們探討如何整合語境結構、時間關聯及語意概念偵測器等多重線索。最後，我們探討如何降低影片範疇轉移所帶的來的衝擊和影響。具體來說，本論文提出了四種不同的框架，探討從使用者提供之標記資料、偵測器產生之預測資料以及同時使用這兩種資料，進行語境結構及時間關聯的資料探索，進而用以提升影片語意標記的準確性。我們提出的第一種語境結構及時間關聯的整合框架稱為「規則導向的後置過濾框架」。在此架構中，我們一方面以關聯性規則探勘演算法建立不同語意概念同時出現於同一分鏡的語境結構關係；另一方面，我們以統計評量的方法建立同一語意概念同時出現於不同分鏡的時間關聯關係。我們的實驗結果顯示，語境結構關係可提升原始標記準確度約百分之三，而時間關聯關係約可增加百分之十五。除此之外，我們發現語境結構與時間關聯兩者具有互補特性，若把分別經由兩者各自處理後的結果加以結合，可再提高影片語意標記的準確度。其次，為了充分利用語境結構及時間關聯的資訊，我們提出另一種整合框架稱作「多重線索融合框架」。有別於前者，此架構創新之處有二：第一，我們採取機率模型與遞迴方式設計一套資料驅動的演算法，從標記的訓練資料中，分別建立語意概念的語境結構及時間關聯之多階關係。第二，我們採用圖形模型，將取得的多階關係與偵測器的預測結果整合於同一網路，透過能量函式最佳化的方式，找出最接近偵測器預測結果與最符合語境結構及時間關聯的解。藉由多重線索融合框架的技術，多數偵測器預測錯誤的標記結果，皆能獲得適當的修正。另外，為了減少訓練影片與測試影片因資料來源範疇差異所導致多階關係可能不一致的潛在問題，我們提出「跨範疇多重線索融合框架」，此架構同時整合了從使用者提供之標記資料（訓練影片）與從偵測器產生之預測資料（測試影片）所取得的語境結構和時間關聯。藉由整合從測試資料取得的多階關係，從訓練資料取得的多階關係可獲得妥善的調整，進而適用於和原始影片不同範疇的目標影片上。最後，我們提出一種非監督式的「協同過濾框架」用以改進偵測器預測影片標記不夠準確的缺點，這種架構的主要優勢是沒有跨影片範疇的問題。在此架構中，我們利用分鏡與分鏡相似以及概念與概念相關的兩個重要性質，針對語意概念偵測器產生之標記預測結果，進行資料相依性的探索。我們把所有語意概念出現於全部分鏡的可能性建構成一個矩陣，透過矩陣分解，取得一個近似於原始矩陣的低維度矩陣。由於低維度矩陣的相依關係具有較原始矩陣更高的一致性，因此可大幅降低原始矩陣中不正確的資料數量。換句話說，偵測器預測不準確的標記結果在協同過濾的運作下，可獲得適度的修正，進而改善整體的準確度。為了驗證以上四種整合語境結構及時間關聯的框架，我們以TRECVID 數據集進行實驗，結果顯示我們所提出的方法不但效率好，效益也高；其中，對於大部分語意概念來說，標記準確度都有顯著的提升。除此之外，使用我們提出的方法所建立之語境結構及時間關聯的關係，亦能普遍適用於各種不同偵測器的預測結果上。總結來說，本論文的具體貢獻可扼要說明如下：第一，針對影片中語意概念的語境結構及時間關聯，是深入且完整的研究。第二，針對各種不同的資料進行語意概念的語境結構及時間關聯之探索，並用以提升影片標記準確度。第三，提出一個整合多影片範疇多重線索的框架，其準確度提升的幅度是目前所有公開技術中表現最好的。第四，本論文提出第一個以非監督式方法同時利用語境結構及時間關聯來改進影片標記準確度的框架。 The huge amount of videos currently available poses a difficult problem in semantic video retrieval. The success of query-by-concept, recently proposed to handle this problem, depends greatly on the accuracy of concept-based video indexing. This thesis studies three key issues toward improving concept detection: (1) how to explore cues beyond low-level features in an efficient and effective way, (2) how to integrate these learned high-level relations with independent concept detectors into a common framework, and (3) how to exploit the information embedded within the initial detection results to alleviate cross-domain problems. Specifically, we propose several frameworks to take advantage of both contextual correlation and temporal dependency from user-provided annotations and/or detector-generated predictions for various application scenarios. We first present a rule-based post-filtering framework that combines contextual correlation and temporal dependency to enhance robustness and accuracy of semantic concept detection. Given manually annotated ground truth, we use association rule mining techniques to discover inter-concept contextual relationships and adopt a strategy to combine correlated detectors. In addition, we investigate statistical measurements to discover inter-shot temporal relationships and propose a filter design to fuse dependent detectors. Experiments on the TRECVID 2005 data set show our framework is not only effective but efficient. Furthermore, it can be easily integrated with existing detectors to boost their performance. To exploit the refined scores for inference, instead of direct use of detection scores, we introduce a multi-cue fusion framework that explores and unifies both contextual correlation among concepts as well as temporal dependency among shots. This framework is novel in two ways. First, a recursive algorithm is proposed to learn both inter-concept and inter-shot relationships from manual annotations of tens of thousands of shots with hundreds of concepts. Second, labels for all concepts and all shots in a video are solved simultaneously through optimizing a graphical model. Experiments on the TRECVID 2006 data set %evaluation benchmark show that our framework is promising for semantic video indexing, achieving around a 30% performance boost on two popular baselines, VIREO-374 and Columbia374, in inferred average precision. Toward solving the problem of domain change between training and test videos, we propose a cross-domain multi-cue fusion framework that explores multiple cues across various video domains and fuses them all together then. In this framework, test shots are assigned with pseudo-labels so that contextual and temporal relations can be modeled in an unsupervised manner. Integration of the relationships learned from user-provided annotations (training videos) and detector-generated predictions (test videos) accommodates the domain change, leading to greater labeling quality. Extensive experiments on the TRECVID 2006--2008 data sets show that our framework outperforms the stat-of-the-art approaches, achieving significant performance gains (ranging from 27% to 61% for different settings) on a widely used benchmark. Finally, this thesis describes a collaborative filtering framework that refines the initial detection scores in a fully unsupervised fashion, through exploring shot-to-shot (clip-to-clip) similarity and concept-to-concept correlation in a large collection of testing videos. We treat the noisy (inaccurate) scores for all concepts and all shots as a matrix. These scores are then de-noised via matrix factorization which discovers data dependence within the matrix. We further improve this method by dividing the score matrix into patches. Better models are learned from the grouped similar patches to further enhance the detection accuracy. In addition to the easy-to-implement advantage, experiments on the TRECVID 2006-2008 evaluation benchmarks achieve salient improvement, ranging from 20% to 50%, without using any labeled training data or external resources. The major contributions of this thesis can be summarized as follows. (1) An in-depth investigation of jointly exploiting both inter-concept correlation and inter-shot dependency to enhance detecting the presence of generic concepts. (2) The first study covering the exploration of various sources for discovering relational knowledge to benefit semantic video indexing. (3) A state-of-the-art system that enables a fusion of multiple cues from multiple domains, %when used in conjunction with detection scores, yielding the highest performance improvement on exploring high-level relations for concept detection. (4) The first unsupervised approach of simultaneously utilizing both contextual and temporal information to improve concept-based video indexing.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/46565
全文授權:	有償授權
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-99-1.pdf 未授權公開取用	2.67 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。