異質雲端資料中心之巨量資料處理效能保證與改進

Chien-Hung Chen; 陳建宏

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71255

標題:	異質雲端資料中心之巨量資料處理效能保證與改進 Guarantee and Improvement in Performance of Big Data Processing over Heterogeneous Cloud Data Centers
作者:	Chien-Hung Chen 陳建宏
指導教授:	郭斯彥(Sy-Yen Kuo)
關鍵字:	雲端計算,多資料中心,巨量資料處理,任務分配,資料偏斜,巨量資料排程,異質性,記憶體技術,資料預取,最佳化演算法,啟發式演算法, Cloud Computing,Multiple Data Centers,Big Data Processing,Task Assignment,Data Skew,Job Scheduling,Heterogeneity,In-memory Techniques,Data Prefetching,Optimal Solution,Heuristic Algorithm,
出版年 :	2018
學位:	博士
摘要:	隨著物聯網發展，越來越多裝置傳送資料到雲端進行分析。為提供物聯網快速的響應，多個雲端資料中心被部署在不同的地利位置，讓雲更靠近物聯網裝置。因此巨量資料處理應用程式在雲端執行時需要從多個資料中心的遠端伺服器上讀取大量資料，而它的執行時間主要受到旗下每個子任務的資料讀取時間所影響。漫長的資料讀取時間嚴重惡化了巨量資料處理效能。本論文研究巨量資料處理在多個異質雲端資料中心的效能，其中包含三個研究主題。第一，考慮資料讀取成本及資料偏斜，本論文研究巨量資料處理應用程式的任務分配問題。使用雲端資料中心的網路拓撲以及資料儲存位置來計算每一個任務的資料讀取成本，將每個任務分配到資料讀取成本較低的伺服器執行可以減少一個應用程式的資料讀取時間。此外，當大量任務在雲端等待執行，系統需較長的計算時間為每一個任務挑選資料成本最小的伺服器。為解決此問題，本論文提出了一個貪婪演算法及一個啟發式演算法，盡可能降低資料讀取成本及演算法計算時間。第二，本論文研究如何排程多個具有執行時間限制的巨量資料處理應用程式。在現有的排程方法中並不考慮每個計算節點具有不同的執行效能，任務的執行時間也可能隨著系統負荷動態改變。針對此問題，本論文提出了一個新的排程演算法。此方法將具時間限制的巨量資料排程問題轉換成著名的最小權重二分匹配問題並且獲得最佳解。當系統無法滿足所有工作的執行時間限制時，它也可以最小化無法滿足條件的工作數量。第三，在記憶體技術方面，本論文提出了新的資料預取機制。此機制根據工作排程的資訊預先將資料載入至記憶體或將資料從記憶體驅逐，藉此減少資料的讀取時間並且回收珍貴的記憶體資源。最後，本論文以模擬程式和真實的巨量資料系統進行實驗，證實本論文所提出方法之可行性及有效性。 With rapid growth of the Internet of Things (IoT), more and more devices transmit data to cloud for analysis. To provide quick responsiveness to the IoT services, multiple cloud data centers are geographically distributed to get close to the IoT devices. A data processing application run on the cloud requires to access a large amount of data stored on remote servers across the data centers. As the execution time is mainly dependent on the network access latency of each task, long data access latency deteriorates the performance of data processing. This paper investigates the performance of big data processing in heterogeneous cloud data centers including three main topics. First, the task assignment problem with considerations of data access cost and data skew is investigated. The network topology of cloud data centers and data locations are used to formulate the data access cost of each task. By assigning tasks to the servers with lower data access costs, the data access time of data processing application can be decreased. Moreover, when a large number of tasks are waiting to be assigned, it is difficult to quickly select the best servers with the lowest data access costs from multiple data centers for all waiting tasks. To solve the task assignment problem, a greedy algorithm and a heuristic are proposed for reducing the total data access cost and the algorithm computational time. Second, the deadline-constrained job scheduling is investigated. The existing deadline-constrained job schedulers do not consider the following two problems: various node performance and dynamical task execution time. In this paper, the Bipartite Graph modelling is utilized to propose a new Scheduler. It can obtain the optimal solution of the deadline-constrained scheduling problem by transforming the problem into a well-known minimum weighted bipartite matching problem. The proposed scheduler is able to shorten the data access time of jobs. If the total available computing resources of the system cannot satisfy the deadline requirements of all jobs, it can also minimize the number of jobs with the deadline violation. The third topic investigates in-memory techniques. The scheduling-aware data prefetching and eviction mechanisms are proposed for prefetching data to memory and releasing memory resources based on the scheduling information. Finally, both simulations and real experiments are performed to demonstrate the effectiveness of the proposed approaches.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71255
DOI:	10.6342/NTU201801913
全文授權:	有償授權
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-107-1.pdf 目前未授權公開取用	5.06 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。