Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 資訊網路與多媒體研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/54106
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor陳偉松(Tony Tan)
dc.contributor.authorShih-Wei Huangen
dc.contributor.author黃世瑋zh_TW
dc.date.accessioned2021-06-16T02:40:15Z-
dc.date.available2020-08-21
dc.date.copyright2020-08-21
dc.date.issued2020
dc.date.submitted2020-08-13
dc.identifier.citation[1] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. In Erich M. Nahum and Dongyan Xu, editors, 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10, Boston, MA, USA, June 22, 2010. USENIX Association, 2010.
[2] Michael Armbrust, Tathagata Das, Aaron Davidson, Ali Ghodsi, Andrew Or, Josh Rosen, Ion Stoica, Patrick Wendell, Reynold Xin, and Matei Zaharia. Scaling spark in the real world: Performance and usability. Proc. VLDB Endow., 8(12):1840–1843, 2015.
[3] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, 2008.
[4] Spark joins. https://www.linkedin.com/pulse/spark-sql-3-common-joins-explained- ram-ghadiyaram.
[5] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. Spark SQL: relational data processing in spark. In Timos K. Sellis, Susan B. Davidson, and Zachary G. Ives, editors, Proceedings of the 2015 ACM SIGMOD In- ternational Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pages 1383–1394. ACM, 2015.
[6] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Mur- phy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Steven D.Gribble and Dina Katabi, editors, Proceedings of the 9th USENIX Symposium on Net- worked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25-27, 2012, pages 15–28. USENIX Association, 2012.
[7] ClaudeBarthels,GustavoAlonso,TorstenHoefler,TimoSchneider,andIngoMüller. Distributed join algorithms on thousands of cores. Proc. VLDB Endow., 10(5):517– 528, 2017.
[8] Jonny Daenen, Frank Neven, Tony Tan, and Stijn Vansummeren. Parallel evaluation of multi-semi-joins. CoRR, abs/1605.05219, 2016.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/54106-
dc.description.abstractApache Spark在叢集系統中提供許多高速運算的模組,其中Spark- SQL負責分散式資料庫高效率的查詢等演算法。
在分散式資料庫,大多數程序都牽涉到分散式系統中不同節點間資料的交換這個成本高、耗時長的過程。洗牌雜湊加入查詢是一個評估加入查詢的有名演算法,但我們發現他在節點間造成不必要的資料交換,且有機會發生計算負擔不平衡的狀況。
我們提出一個洗牌雜湊加入查詢的優化版本來評估半加入查詢,其名為RDTS(Reducing Data Transfer for Semijoin)。他不只減少了節點間不必要的資料交換,也確保了各節點的計算負擔平衡。
我們用Scala這個語言在Spark上實作RDTS,且比較其與原本的差異。此外,我們的演算法能夠輕易的延伸以評估複數半加入查詢。
zh_TW
dc.description.abstractApache Spark provides several modules for fast computation in cluster system. Spark SQL is one of its modules dedicated to efficient SQL query evaluation in distributed database. Most processes in distributed database require data exchange between multiple nodes in distributed system, which can be costly and time consuming.
Shuffle hash join is one well known algorithm for evaluating join in dis- tributed database system. We discover that it incurs unnecessary data ex- change and may result in load imbalance between the nodes. We propose an algorithm for semijoin/antijoin evaluation, which is an improved version of shuffle hash join, and we call it RDTS (Reducing Data Transfer for Semi- join). It not only reduces the amount of data exchange between nodes, but also guarantees load balance among the nodes.
We implement RDTS in Spark using the language Scala and compare the difference between our algorithm and shuffle hash join. Our algorithm can be easily extended for multiple semijoin/antijoin evaluation.
en
dc.description.provenanceMade available in DSpace on 2021-06-16T02:40:15Z (GMT). No. of bitstreams: 1
U0001-0308202023562800.pdf: 14629940 bytes, checksum: 5cbdc175c9af92a469727d935b871539 (MD5)
Previous issue date: 2020
en
dc.description.tableofcontents口試委員會審定書 iii
誌謝 v
Acknowledgements vii
摘要 ix
Abstract xi
1 Introduction 1
1.1 Background................................. 1
1.2 Motivation.................................. 2
1.3 Summary of contributions ......................... 3
1.4 Thesis outline................................ 4
2 Relational Algebra and SQL 5
2.1 Relational algebra .............................. 5
2.2 SQL....... .............................. 7
3 Spark and Spark SQL 11
3.1 Introduction to Apache Spark........................ 11
3.2 RDD..................................... 13
3.2.1 Functions of RDD ......................... 13
3.2.2 Shuffle operations ......................... 15
3.3 Dataframe.................................. 16
3.3.1 Functions of Dataframes ...................... 16
3.4 Join implementations in Spark SQL .................... 17
4 Multiple Semijoin Queries 21
4.1 Problem formulations............................ 21
4.2 The algorithms ............................... 23
4.2.1 The algorithms for a single query ................. 23
4.2.2 The algorithms for MSQ queries.................. 25
5 Experiments 27
5.1 Experimental environment ......................... 27
5.2 Input data.................................. 28
5.3 Experiment result.............................. 30
5.3.1 Execution time and memory usage................. 30
6 Conclusion 39
A Experiment result tables 41
Bibliography 59
dc.language.isoen
dc.subject分散式資料庫zh_TW
dc.subjectApache Sparkzh_TW
dc.subjectSpark SQLzh_TW
dc.subject半加入查詢zh_TW
dc.subject分散式資料庫zh_TW
dc.subjectApache Sparkzh_TW
dc.subjectSpark SQLzh_TW
dc.subject半加入查詢zh_TW
dc.subjectApache Sparken
dc.subjectdistributed databaseen
dc.subjectApache Sparken
dc.subjectSpark SQLen
dc.subjectsemijoin evaluationen
dc.subjectdistributed databaseen
dc.subjectSpark SQLen
dc.subjectsemijoin evaluationen
dc.titleSpark洗牌雜湊加入查詢的優化zh_TW
dc.titleImproved Shuffle Hash Join for Sparken
dc.typeThesis
dc.date.schoolyear108-2
dc.description.degree碩士
dc.contributor.oralexamcommittee林忠緯(Chung-Wei Lin),陳郁方(Yu-Fang Chen)
dc.subject.keyword分散式資料庫,Apache Spark,Spark SQL,半加入查詢,zh_TW
dc.subject.keyworddistributed database,Apache Spark,Spark SQL,semijoin evaluation,en
dc.relation.page60
dc.identifier.doi10.6342/NTU202002333
dc.rights.note有償授權
dc.date.accepted2020-08-14
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept資訊網路與多媒體研究所zh_TW
顯示於系所單位:資訊網路與多媒體研究所

文件中的檔案:
檔案 大小格式 
U0001-0308202023562800.pdf
  未授權公開取用
14.29 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved