Please use this identifier to cite or link to this item: http://hdl.handle.net/11455/90788
標題: 一種利用假鍵以改善MapReduce負載平衡的虛擬處理機分群機制
A Virtual Processor Partition Mechanism with Pseudo Key for Load Balancing in MapReduce
作者: Wei-Hsiang Chung
鍾威翔
關鍵字: MapReduce;Hadoop;Load balancing;Skew problem;Partition strategy;Pseudo key;Virtural processor partition;MapReduce;Hadoop;負載平衡;斜曲問題;分配機制;假鍵;虛擬處理機分配
引用: [1] A. Silberschatz, P. B. Galvin and G. Gagne, Operating System Concepts 9/e, WILEY, 2013. [2] T. White, Hadoop: The Definitive Guide, O' Reilly Media, Inc., 2012. [3] S. Brin and L. Page, 'The Anatomy of a Large-Scale Hypertextual Web Search Engine,' Computer networks and ISDN systems, 30(1), 1998, pages 107-117. [4] J. S. Brown and R. P. Adler, 'Open Education, The Long Tail, and Learning 2.0,' Educause review, 43(1), 2008, pages 16-20. [5] F. Chang, J. Dean, S. Ghemawat, C. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes and R. E. Gruber, 'Bigtable: A Distributed Storage System for Structured Data,' ACM Transactions on Computer Systems, 26(4), 2008, pages 1-26. [6] Q. Chen, C. Liu and Z. Xiao, 'Improving MapReduce Performance Using Smart Speculative Execution Strategy,' IEEE Transactions on Computers, 63(4), 2014, pages 954-967. [7] J. Dean and S. Ghemawat, 'MapReduce: Simplified Data Processing on Large Clusters,' Communications of the ACM, 51(1), 2008, pages 107-113. [8] D. J. DeWitt and J. Gray, 'Parallel Database Systems: The Future of High Performance Database Systems,' Communications of the ACM, 35(6), 1992, pages 85-98. [9] D. J. DeWitt, J. F. Naughton, D. A. Schneider and S. Seshadri, 'Practical Skew Handling in Parallel Joins,' Proceedings of the 18th International Conference on Very Large Data Bases, 1992., pages 27-40. [10] S. Ghemawat, H. Gobioff, and S.-T. Leung, 'The Google File System,' ACM SIGOPS Operating Systems Review, 37(5), 2003, pages 29-43. [11] B. Gufler, N. Augsten, A. Reiser and A. Kemper, 'Load Balancing in MapReduce Based on Scalable Cardinality Estimates,' Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, 2012, pages 523-533. [12] S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He and L. Qi, 'LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud,' Proceedings of the 2nd IEEE International Conference on Cloud Computing Technology and Science, 2010, pages 17-24. [13] L. Kolb, A. Thor and E. Rahm, 'Block-based Load Balancing for Entity Resolution with MapReduce,' Proceedings of the 20th ACM international conference on Information and knowledge management, 2011, pages 2397-2400. [14] L. Kolb, A. Thor and E. Rahm, 'Load Balancing for MapReduce-based Entity Resolution,' Proceedings of the 2012 IEEE 28th International Conference on Data Engineering (ICDE), 2012, pages 618-629. [15] Y. Kwon, M. Balazinska, B. Howe and J. Rolia, 'Skew-resistant Parallel Processing of Feature-extracting Scientific User-defined Functions,' Proceedings of the 1st ACM symposium on Cloud computing, 2010, pages 75-86. [16] Y. Kwon, M. Balazinska, B. Howe and J. Rolia, 'A Study of Skew in MapReduce Applications,' Open Cirrus Summit, 2011. [17] Y. Kwon, M. Balazinska, B. Howe and J. Rolia, 'Skewtune: Mitigating Skew in MapReduce Applications,' Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 2012, pages 25-36. [18] D. Li, Y. Chen and R. H. Hai, 'Skew-Aware Task Scheduling in Clouds,' Proceedings of the IEEE Seventh International Symposium on Service-Oriented System Engineering, 2013, pages 342-346. [19] V. Martha, W. Zhao and X. Xu, 'h-MapReduce: A Framework for Workload Balancing in MapReduce,' Proceedings of the IEEE 27th International Conference on Advanced Information Networking and Applications (AINA), 2013, pages 637-644. [20] S. R. Ramakrishnan, G. Swart and A. Urmanov, 'Balancing Reducer Skew in MapReduce Workloads using Progressive Sampling,' Proceedings of the Third ACM Symposium on Cloud Computing,' 2012, article 16. [21] M. C. Schatz, 'CloudBurst: Highly Sensitive Rad Mapping with MapReduce,' Bioinformatics 25(11), 2009, pages 1363-1369. [22] K. Shvachko, H. Kuang, S. Radia and R. Chansler, 'The Hadoop Distributed File System,' Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 2010, pages 1-10. [23] Y. Xu, P. Zou, W. Qu, Z. Li, K. Li and X. Cui, 'Sampling-based Partitioning in MapReduce for Skewed Data,' Proceedings of the 2012 Seventh ChinaGrid Annual Conference, 2012, pages 1-8. [24] W. Yan, Y. Xue and B. Malin, 'Scalable and Robust Key Group Size Estimation for Reducer Load Balancing in MapReduce,' Proceedings of the IEEE International Conference on Big Data, 2013, pages 156-162. [25] M. Zaharia, A. Konwinski, A. Joseph, R. Katz, and I. Stoica, 'Improving MapReduce Performance in Heterogeneous Environments,' Operating Systems Design and Implementation, 8(4), 2008, Pages: 29-42. [26] Apache Hadoop Project, http://hadoop.apache.org/. [27] SNAP: Stanford Network Analysis Project, http://snap.stanford.edu/.
摘要: 
The MapReduce framework is an efficient parallel computing platform. However, the performance of MapReduce applications is affected by the workload distribution of work nodes. Imbalance workload will lead to poor performance because the overloaded workers will become the bottleneck of the system. The imbalance workload is caused by two subproblems: one is the poor partition strategy, which assigns the keys simply to work nodes (called 'bad partition' problem) and the other is the 'key skew' problem, where some keys have a lager number of values than others. This research proposes a mechanism called Virtual-Processor-Partition and a strategy called Pseudo Key. They are proposed for resolving both the 'bad partition' and 'key skew' problems. The mechanism distributes the workload to a large number of virtual processors and then redistributes the balanced workload to the real workers, while the Pseudo Key strategy resolves the 'key skew' problem by dividing the 'skew keys' into several pseudo keys and assigning the pseudo keys to different Reduce tasks. The experimental results show that the proposed mechanism can improve the workload balancing for the MapReduce framework. In addition, the results show that the performance is notably better than the native Hadoop system. The experimental results of testing workload balancing show that the improved workload is much more balanced than the native Hadoop system even if the dataset is very much skew. The experimental results of execution time show that the improved performance is 1.5 to 2 times faster the native Hadoop system in average.

MapReduce是一個高效率的平行運算平台。然而,MapReduce上的應用程式執行效率好壞易受到節點工作量的分配是否平均所影響。不平整的工作分配形成節點工作超載,進而造成不佳的執行效能,成為整個系統的效能瓶頸。有兩個主要的原因會造成工作分配不平整:其一為不良或過於簡單的分配機制造成之工作分配不平整,此問題稱為「不良分配(bad partition)」問題;其二為「鍵斜曲(key skew)」問題,「鍵斜曲」是由於有部份的鍵(key)具有非常大量的值(value),而在MapReduce上相同鍵會集中在一個節點處理,使得這些斜曲鍵(skew key)易造成工作負載不平衡。對於此兩類的負載不平衡問題,本研究分別提出虛擬處理機分群機制以及假鍵策略以解決這些問題。虛擬處理機分群機制藉由分配鍵值對(key-value pair)到大量的虛擬機,對工作負載進行重分配以達到工作負載平衡。針對「鍵斜曲」問題,本研究提出假鍵策略。假鍵策略將斜曲鍵打散成多個假鍵(pseudo key),進而將這些假鍵分散到多個工作節點處理以達到工作平均分配,並且在主要MapReduce工作(main pass)後加入一個完整的MapReduce工作(Merge pass)用以整合各節點以執行假鍵產生的部分結果(partial result)。最後,本研究實作虛擬處理機分群機制及假鍵策略,並且進行效能測試。在實驗結果中,本研究提出的機制以及策略明顯增進了MapReduce程式執行的效能以及降低了工作負載不平衡的現象。在工作負載平衡方面,本分群機制大幅提升負載平衡,甚至在高度斜曲的資料下亦有相當平整的分配結果。在工作執行效率方面,本分群機制工作執行效能相較於原來執行效能提升1.5倍至2倍。
URI: http://hdl.handle.net/11455/90788
其他識別: U0005-0408201516212200
Rights: 同意授權瀏覽/列印電子全文服務,2017-08-05起公開。
Appears in Collections:資訊科學與工程學系所

Files in This Item:
File Description SizeFormat Existing users please Login
nchu-104-7102056005-1.pdf489.14 kBAdobe PDFThis file is only available in the university internal network    Request a copy
Show full item record
 

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.