運用網站日誌探勘提昇網際網路搜尋與廣告之效能

Chieh-Jen Wang; 王界人

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/62522

標題:	運用網站日誌探勘提昇網際網路搜尋與廣告之效能 Enhancing Effectiveness of Internet Search and Advertising with Web Log Mining
作者:	Chieh-Jen Wang 王界人
指導教授:	陳信希(Hsin-Hsi Chen)
關鍵字:	網站日誌探勘,資訊檢索,網際網路搜尋協助,網際網路廣告,搜尋結果多樣性, Web log mining,Information retrieval,Web search enhancement,Internet Advertising,Search Results Diversification,
出版年 :	2013
學位:	博士
摘要:	網站日誌探勘(web log mining)近年來被應用於許多不同研究領域，尤其是在網際網路搜尋與廣告。從網站日誌中，探勘有用的使用者經驗與知識，可以有效提昇網際網路搜尋與廣告之效能。本論文提出不同的模型，從網站日誌中探勘使用者經驗與知識，並將其應用於不同的研究領域，例如：搜尋結果多樣性、解決困難搜尋任務、網路廣告點擊預測與網路廣告詞每次點擊成本(cost-per-click)預測。鑑於使用者查詢詞有時會有歧義性與多種不同表達方式，如何多樣化搜尋結果，來涵蓋使用者不同資訊需求是很重要的議題。我們利用網站日誌探勘技術，探勘查詢詞之可能子議題，藉以多樣化搜尋結果。本論文由不同面向探勘查詢詞子議題，包含直接與間接兩種方式：直接探勘是由查詢詞本身，間接探勘則是使用搜尋引擎回傳之相關文件。完成子議題探勘後，我們提出基於子議題搜尋結果多樣性演算法，利用已探勘的子議題來重新排序搜尋結果。實驗結果證實，我們所提出的演算法不但能確保搜尋品質，且能有效地多樣化搜尋結果。使用者的搜尋任務(search task)有時較為困難，困難的搜尋任務通常包含多個子任務(subtask)，無法使用單一查詢詞來滿足此種性質的搜尋任務。我們接著介紹如何從網站日誌中，學習使用者行為與經驗，並且探勘出必要的動作，來協助使用者完成較困難的資訊需求。從網站日誌中探勘必要動作，首先必需在查詢階段(session)中識別使用者搜尋意圖之邊界。我們提出三種不同的演算法來偵測使用搜尋意圖之邊界，此三種演算法基於不同概念，如：時間、查詢詞類型與群體相似度。我們從網站日誌中挑選查詢階段，對於這些查詢階段抽取特徵，然後將相似意圖的查詢階段群聚在一起，形成意圖群，藉以建立偵測搜尋意圖邊界之識別系統。當搜尋意圖邊界識別完成後，我們提出一個新穎的搜尋支援系統，來提昇網路搜尋系統之效能，稱為『搜尋腳本』。經由探勘網路使用者之群眾智慧，搜尋腳本會針對每個困難搜尋任務，自動產生一連串的動作，幫助使用者完成困難搜尋任務，提昇資訊檢索系統效能。我們最後探討如何利用網站日誌探勘技術，來提昇網路廣告之效能與增加廣告收入。我們會對兩種不同網路廣告機制：廣告刊登者與廣告商，分別探討。對於廣告刊登者，我們提出基於使用者意圖之廣告點擊預測系統，該系統從海量的廣告搜尋與點擊日誌中，學習使用者點擊行為，並將其應用於廣告點擊預測。對於廣告商，廣告在搜尋結果頁面上之排名，廣告詞每次點擊成本是一個重要的因素。如何選擇一個合適的廣告詞每次點擊成本，對於廣告商來說相當重要。我們萃取不同語義層次的特徵，從大規模真實世界的廣告詞語料中，利用不的同學習演算法，學習不同語義層次對於廣告詞每次點擊成本之變化，藉以推薦廣告商最適合的廣告詞每次點擊成本。 Web log mining has attracted considerable attention recently in many research fields, especially in Internet search and advertising. The valuable user experiences and knowledge mined from web log data improve either effectiveness or efficiency of Internet search and advertising. In this dissertation, we propose several models to mine valuable user experiences and knowledge from different types of web log data, and apply them to many potential applications, such as search results diversification, complex search task support, ad click and cost-per-click (CPC) prediction. Firstly, we apply web log mining techniques to discover the subtopics of a query for search results diversification. User queries to the web tend to have more than one interpretation due to their ambiguity and other characteristics. How to diversify the ranking results to meet users’ various potential information needs has become an important issue. We aim at mining the subtopics of a query either indirectly from the returned results of retrieval systems or directly from the query itself to diversify the search results. After subtopic mining, we propose a subtopic-based diversified retrieval model to rank a list of documents with respect to the mined subtopics for balancing relevance and diversity. According to the experimental results, the subtopic-based diversified retrieval model not only improves ranking quality, but also broadens the subtopic coverage within the retrieved search results. Secondly, we automatically mine sequences of required actions from web log data for complex search task support, which contain more than one subtask and cannot be satisfied with a single query. Identifying user’s search intent boundary in sessions is indispensable to learn users’ behaviors and mine required actions for use in a complex search task. The time-based, query-based, and intent-cluster-based approaches are proposed to identify users’ search intent boundary. We first select sessions from a large-scale real-world web log dataset, then extract features from the selected sessions, and finally cluster sessions of similar intent. The resulting intent clusters are used to identify users’ search intent boundaries. After intent boundary identification, we propose a novel algorithm to mine sequences of actions called search scripts from intent clusters. The search scripts, which consist of a sequence of actions, can be applied to guide users in complex search tasks for improving search effectiveness of retrieval systems. Thirdly, we aim at enhancing the effectiveness of Internet advertising with web log mining. We propose several intelligent mechanisms to help both ad network operators (i.e., a search engine) and advertisers for improving the ad performance and increasing the revenue. For the ad network operator side, we propose an intent-based system to predict ad clicks. This system learns user’s click behaviors from ad search and click logs and predicts potential ad clicks of users. For the advertiser side, ranking of an ad on search result pages depends on a major factor CPC of ad words that is offered by an advertiser. How to select an appropriate CPC for ad words is indispensable for advertisers. We extract different semantic levels of features from a large-scale real-world ad words dataset, and explore various machine learning algorithms to predict an appropriate CPC for advertisers.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/62522
全文授權:	有償授權
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-102-1.pdf 目前未授權公開取用	1.73 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。