Skip navigation

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More
DSpace logo
English
中文
  • Browse
    • Communities
      & Collections
    • Publication Year
    • Author
    • Title
    • Subject
  • Search TDR
  • Rights Q&A
    • My Page
    • Receive email
      updates
    • Edit Profile
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊網路與多媒體研究所
Please use this identifier to cite or link to this item: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/54106
Title: Spark洗牌雜湊加入查詢的優化
Improved Shuffle Hash Join for Spark
Authors: Shih-Wei Huang
黃世瑋
Advisor: 陳偉松(Tony Tan)
Keyword: 分散式資料庫,Apache Spark,Spark SQL,半加入查詢,
distributed database,Apache Spark,Spark SQL,semijoin evaluation,
Publication Year : 2020
Degree: 碩士
Abstract: Apache Spark在叢集系統中提供許多高速運算的模組,其中Spark- SQL負責分散式資料庫高效率的查詢等演算法。
在分散式資料庫,大多數程序都牽涉到分散式系統中不同節點間資料的交換這個成本高、耗時長的過程。洗牌雜湊加入查詢是一個評估加入查詢的有名演算法,但我們發現他在節點間造成不必要的資料交換,且有機會發生計算負擔不平衡的狀況。
我們提出一個洗牌雜湊加入查詢的優化版本來評估半加入查詢,其名為RDTS(Reducing Data Transfer for Semijoin)。他不只減少了節點間不必要的資料交換,也確保了各節點的計算負擔平衡。
我們用Scala這個語言在Spark上實作RDTS,且比較其與原本的差異。此外,我們的演算法能夠輕易的延伸以評估複數半加入查詢。
Apache Spark provides several modules for fast computation in cluster system. Spark SQL is one of its modules dedicated to efficient SQL query evaluation in distributed database. Most processes in distributed database require data exchange between multiple nodes in distributed system, which can be costly and time consuming.
Shuffle hash join is one well known algorithm for evaluating join in dis- tributed database system. We discover that it incurs unnecessary data ex- change and may result in load imbalance between the nodes. We propose an algorithm for semijoin/antijoin evaluation, which is an improved version of shuffle hash join, and we call it RDTS (Reducing Data Transfer for Semi- join). It not only reduces the amount of data exchange between nodes, but also guarantees load balance among the nodes.
We implement RDTS in Spark using the language Scala and compare the difference between our algorithm and shuffle hash join. Our algorithm can be easily extended for multiple semijoin/antijoin evaluation.
URI: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/54106
DOI: 10.6342/NTU202002333
Fulltext Rights: 有償授權
Appears in Collections:資訊網路與多媒體研究所

Files in This Item:
File SizeFormat 
U0001-0308202023562800.pdf
  Restricted Access
14.29 MBAdobe PDF
Show full item record


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved