基於Cassandra資料庫之雲端資料建模：從SQL到NoSQL

Yi-Hsiung Chen; 陳義雄

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/65151

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	郭斯彥(Sy-Yen Kuo)
dc.contributor.author	Yi-Hsiung Chen	en
dc.contributor.author	陳義雄	zh_TW
dc.date.accessioned	2021-06-16T23:27:16Z	-
dc.date.available	2017-08-09
dc.date.copyright	2012-08-09
dc.date.issued	2012
dc.date.submitted	2012-07-31
dc.identifier.citation	[1] D. Abadi. Data management in the cloud: Limitations and opportunities. IEEE Data Engineering Bulletin, 32(1):3–12, 2009. [2] D. Agrawal, A. El Abbadi, S. Antony, and S. Das. Data management challenges in cloud computing infrastructures. Databases in Networked Information Systems, pages 1–10, 2010. [3] M. Armbrust, A. Fox, R. Griﬃth, A. Joseph, R. Katz, A. Konwinski, G. Lee,D. Patterson, A. Rabkin, I. Stoica, et al. A view of cloud computing. Communications of the ACM, 53(4):50–58, 2010. [4] D. Borthakur. The hadoop distributed ﬁle system: Architecture and design.Hadoop Project Website, 11:21, 2007. [5] E. Brewer. Towards robust distributed systems. In Proceedings of the Annual ACM Symposium on Principles of Distributed Computing, volume 19, pages 7–10, 2000. [6] E. Brewer. Cap twelve years later: How the” rules” have changed. Computer IEEE Computer Magazine, 45(2):23, 2012. [7] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows,T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2):4:1–4:26, June 2008. [8] T. C. Chiueh. Introduction to itri cloud os. Availabile at http://www.rocusabc.org.tw/upload/ROCUSA/b0ae5a656bede944b998f318b6de8d8e.pdf, last accessed on July 2012. [9] W. C.-C. Chu, C.-W. Lu, J.-N. Chen, C.-H. Chang, C.-T. Yang, H.-M. Lee, and H.-M. Lee. Cloud computing in taiwan. Computer, 45(6):48 –56, june 2012. [10] E. F. Codd. A relational model of data for large shared data banks. Commun.ACM, 13(6):377–387, June 1970. [11] E. F. Codd. Further normalization of the data base relational model. Data Base Systems, pages 33–64, 1972. [12] B. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with ycsb. In Proceedings of the 1st ACM symposium on Cloud computing, pages 143–154. ACM, 2010. [13] D. Corporation. Big data management for the enterprise. DataStax Enterprise White Paper, March 2012. [14] A. Davies. High Availability MySQL Cookbook. Packt Pub., 2010. [15] J. Dean and S. Ghemawat. Mapreduce: a ﬂexible data processing tool. Commun. ACM, 53(1):72–77, Jan. 2010. [16] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman,A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon’s highly available key-value store. SIGOPS Oper. Syst. Rev., 41(6):205–220, Oct.2007. [17] D. Featherston. Cassandra: Principles and application. University of Illinois,2010. [18] A. S. Foundation. Sqoop user guide. http://sqoop.apache.org/docs/1.4.1-incubating/SqoopUserGuide.html, last accessed on July 2012. [19] A. Fox, R. Griﬃth, et al. Above the clouds: A berkeley view of cloud computing.Dept. Electrical Eng. and Comput. Sciences, University of California, Berkeley, Rep. UCB/EECS, 28, 2009. [20] S. Gilbert and N. Lynch. Brewer’s conjecture and the feasibility of consistent,available, partition-tolerant web services. SIGACT News, 33(2):51–59, June 2002. [21] E. Hewitt. Cassandra: the deﬁnitive guide. O’Reilly Media, Inc., 2010. [22] ITRI. The world’s ﬁrst ”all-in-one” cloud computing system. ITRI Today, 2011. [23] D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine, and D. Lewin.Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the world wide web. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, STOC ’97, pages 654–663, New York, NY, USA, 1997. ACM. [24] A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 44(2):35–40, 2010. [25] N. Leavitt. Will nosql databases live up to their promise? Computer, 43(2):12–14, feb. 2010. [26] P. Mell and T. Grance. The nist deﬁnition of cloud computing. National Institute of Standards and Technology, 53(6):50, 2009. [27] J. Pereira and R. Oliveira. An object mapping for the cassandra distributed database. 2011. [28] M. Ronstr‥om, A. MySQL, and L. Thalmann. Mysql cluster architecture overview. 2004. [29] S. Sakr, A. Liu, D. Batista, and M. Alomari. A survey of large scale data management approaches in cloud environments. Communications Surveys Tutorials, IEEE, 13(3):311 –336, quarter 2011. [30] M. Slee, A. Agarwal, and M. Kwiatkowski. Thrift: Scalable cross-language services implementation. Facebook White Paper, 2007. [31] C. Strozzi. Nosql-a relational database management system. Web Site: http: // www. strozzi. it/ cgi-bin/ CSA/ tw7/ I/ en US/ nosql/ Home% 20Page ,Accessed, 2010. [32] R. Tavory. Hector: A high level java client for apache cassandra. http:// hector-client.github.com/hector/build/html/index.html, last accessed on July 2012. [33] A. C. Wiki. High level clients for cassandra. http://wiki.apache.org/ cassandra/ClientOptions, last accessed on July 2012.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/65151	-
dc.description.abstract	隨著雲端運算的快速發展，以及社群網站（例如：Facebook、Twitter）的興盛，越來越多的資料儲存在「雲」上。傳統上對於資料儲存及管理的問題主要是透過關聯式資料庫（例如：MySQL）來解決，但是當伺服器的資源不足以應付過於龐大的資料時，我們就必須利用「垂直拓展」來克服，也就是升級伺服器的運算能力，或是加大硬碟儲存空間。垂直拓展的最大問題就是成本昂貴，在雲端運算的時代，資料增加的速度非常驚人，因此伺服器很可能沒過多久就必須再次升級。而「水平拓展」是比較好的方式，在運算叢集中增加伺服器數量，來取代單一機器的升級。可惜的是，傳統的關聯式資料庫由於資料模型的限制，對於水平拓展的支援能力並不好，因此「非關聯式」的資料庫應運而生。非關聯式資料庫（例如：Cassandra）的特色是分散式以及資料模型的自由度，也因此通常都具備了高可得性、高延展性、高效能、以及不會發生單點故障的問題。有越來越多的企業考慮將傳統的資料庫轉換成非關聯式，但轉換的過程卻不是那麼的容易。第一個問題是資料模型的重建，在關聯式的模型設計時，往往是從資料的實體（entity）以及各個實體間的關聯（relation）著手，但在非關聯式的世界，我們卻應該先思考這個系統要提供哪些查詢功能（query），再進一步設計資料模型來最佳化查詢的速度。第二個問題是資料的轉移，企業在轉移之前，往往已經累積了數以萬計的資料，這些資料要以什麼樣的方式轉移到新的資料庫中，也是個相當值得研究的問題，但非關聯式資料庫的研究尚嫌不足，文獻資料非常缺乏，也提高了實作的難度。本論文以一個業界的實際案例作為出發點，針對以上兩個問題提出詳細的探討，並對於如何將MySQL資料庫上的資料轉移到Cassandra資料庫，以實作配合效能評估來作為理論的佐證，希望能做為未來在非關聯式資料庫研究人員的參考。	zh_TW
dc.description.provenance	Made available in DSpace on 2021-06-16T23:27:16Z (GMT). No. of bitstreams: 1 ntu-101-R99921068-1.pdf: 3618394 bytes, checksum: c76e63d9beb6269e778e87a4789b5300 (MD5) Previous issue date: 2012	en
dc.description.tableofcontents	致謝 ii 中文摘要 iii Abstract iv 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Literature Review 5 2.1 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 The NIST deﬁnition . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 Data Management in Cloud . . . . . . . . . . . . . . . . . . . 8 2.2 The CAP Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 Partition Tolerance . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 NoSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 ACID vs. BASE . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2 CAP Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Technologies 14 3.1 ITRI Cloud OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 MySQL Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Apache Cassandra . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.2 Query Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.3 Distribution, Replication and Fault Tolerance . . . . . . . . . 21 4 Data Modeling 22 4.1 Design Diﬀerences Between RDBMS and Cassandra . . . . . . . . . . 22 4.1.1 Query Language . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1.2 Referential Integrity . . . . . . . . . . . . . . . . . . . . . . . 23 4.1.3 Denormalization . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3 Design Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3.1 Materialized Views . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3.2 Secondary Indexes . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3.3 Valueless Column . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3.4 Aggregate Keys . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3.5 Semantic Key . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5 Case Study: ITRI Cloud OS 30 5.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.3.1 Server Conﬁguration . . . . . . . . . . . . . . . . . . . . . . . 35 5.3.2 Access Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3.3 Data Migration . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.4.1 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.4.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6 Conclusions and Future Work 48 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 A Sample Code 51 A.1 Hector API Usages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 A.1.1 Retrieve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 A.1.2 Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 A.1.3 Delete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 A.2 MapReduce Code for Data Migration from HDFS to Cassandra . . . 53 Bibliography 54
dc.language.iso	en
dc.title	基於Cassandra資料庫之雲端資料建模：從SQL到NoSQL	zh_TW
dc.title	Data Modeling in Cloud with Cassandra: From SQL to NoSQL	en
dc.type	Thesis
dc.date.schoolyear	100-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	雷欽隆(Chin-Lung Lei),顏嗣鈞(Hsu-Chun Yen),陳俊良(Chun-Liang Chen),陳英一(Ying-Yi Chen)
dc.subject.keyword	非關聯式資料庫,分散式資料庫,雲端資料處理,資料建模,	zh_TW
dc.subject.keyword	NoSQL,Cloud Data Management,Apache Cassandra,Non-relational Database,Distributed Database,	en
dc.relation.page	56
dc.rights.note	有償授權
dc.date.accepted	2012-07-31
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電機工程學研究所	zh_TW
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
ntu-101-1.pdf 目前未授權公開取用	3.53 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。