收藏本站
《华中科技大学》 2011年
收藏 | 手机打开
二维码
手机客户端打开本文

Performance-Aware Scheduling for Data-Intensive Cloud Computing

Shadi Ibrahim  
【摘要】:Data volumes are ever growing, from traditional applications such as databases and scientific computing to emerging applications like Web 2.0 and online social networks. This has driven intensive research on scalable data intensive systems, including MapReduce and Dryad. Among those systems, Hadoop, an open-source MapReduce implementation, is widely adopted by companies such as Facebook and Google, and academia. Recently, MapReduce has been deployed in the cloud as a software-as-a-service. Due to its wide adoption, the performance of Hadoop in particular (and MapReduce in general) has received much attention in system research. Meanwhile, virtual machines (VM) have become increasingly important for supporting efficient and flexible resource provisioning. By means of this technique, cloud computing provides users with the ability to perform elastic computation using large pools of VMs, without facing the burden of owning or maintaining physical infrastructure. To this end, when building large scale data intensive systems-data intensive cloud computing-developers need to understand the principles of designing large systems to get performance guarantees, load balancing and fair charging for use of resources. Performance in data-intensive cloud computing is contributed by many factors including data locality, application types and the underlying cloud infrastructure which is mainly VM-based. First of all, a novel replica-aware map execution named Maestro is presented to overcome the non-local map execution in MapReduce system. In Maestro, map tasks are scheduled in two phases. The first one, first wave scheduling, schedules the maps when the job initializes to fill all the empty slots, and the second one, run time scheduler, schedules the map tasks according to data locality, node availability and block weight, which is the probability of the best replication to schedule the task. Interestingly, Maestro not only can efficiently achieve higher locality in MapReduce-like systems, but can also reduce unnecessary Map task speculation and balance the intermediate data distribution before the shuffle phase. The existing MapReduce system overlooked the data skew problem that occurs when significant variance in both intermediate keys' frequencies and their distributions among the different data nodes is introduced, referred to as Partitioning Skew. Experimental results with Hadoop demonstrate that, in the presence of partitioning skew, the applications experience performance degradation due to the long data transfer during the shuffle phase along with the computation skew, particularly in the reduce phase. To address this problem, a novel algorithm for locality-aware and fairness-aware key partitioning in MapReduce is developed, referred as LEEN. LEEN embraces an asynchronous map and reduce scheme. All buffered intermediate keys are partitioned according to their frequencies and the fairness of the expected data distribution after the shuffle phase. LEEN can not only efficiently achieve higher locality and reduce the amount of shuffled data, but also LEEN guarantees fair distribution of the reduce inputs. In the cloud, the computing unit is virtual machine (VM) based; therefore, it is important to demonstrate the applicability of data-intensive computing on a virtualized data center. Although virtualization brings many benefits such as resource utilization and isolation, it poses, due to VM interference, a challenging problem for performance predictability and system throughput for large-scale virtualized environments. To this end, a quantitative analysis on the impact of interference on the system fairness is presented. Because Cloud is an economics-based distributed system, the concept of pricing fairness is adopted from micro economics. As a result, the current pay-as-you-go is neither personally nor socially fair. Accordingly, to solve the unfairness caused by interference, new pricing scheme (pay-as-you-consume) is proposed. In the pay-as-you-consume pricing scheme, users are charged according to their effective resource consumption excluding interference. The key idea behind the pay-as-you-consume pricing scheme is a machine learning based prediction model on the relative cost of interference. The preliminary experimental results with Xen demonstrate the accuracy of the prediction model, and the fairness of the pay-as-you-consume pricing scheme. The introduction of virtualization in Hadoop clusters poses new challenges due to the architectural design of the hypervisor. A series of experiments are conducted to measure and analyze the performance of Hadoop on VMs in terms of Hadoop Distributed File System (HDFS) throughput, performance variation with different VM consolidation and configuration, and task speculation. As a result, this dissertation outlines several issues that will need to be considered when implementing MapReduce to fit completely on virtual machines-such as decoupling the storage system (HDFS) from the computation unit (VMs). Later, a novel MapReduce framework that runs on virtual machines, called Cloudlet, is proposed. Virtualization interferences are contributed to by intertwined factors including the application's type, the number of concurrent VMs, and the VM scheduling algorithms used within the host. Further studies revealed that selecting the appropriate disk I/O scheduler pairs can significantly affect the applications performance. Furthermore, a typical Hadoop application consists of different interleaving stages, each requiring different I/O workloads and patterns. As a result, the disk scheduler pairs are not only sub-optimal for different MapReduce applications, but are also sub-optimal for different sub-phases of the whole job. Accordingly, a novel approach for adaptively tuning the disk scheduler pairs in both the hypervisor and the virtual machines during the execution of a single MapReduce job is proposed. Experimental results show that MapReduce performance can be significantly improved; specifically, adaptive tuning of disk scheduler pairs achieves a 25% performance improvement on a sort benchmark with Hadoop.
【学位授予单位】:华中科技大学
【学位级别】:博士
【学位授予年份】:2011
【分类号】:TP311.13

免费申请
【相似文献】
中国期刊全文数据库 前10条
1 郭璇;;华中科技大学机械学院优秀博士学位论文的引文分析[J];情报探索;2010年04期
2 高晓杰;;2004-2014年“高等教育学优秀博士学位论文”获奖情况分析[J];中国高教研究;2014年12期
3 ;华中科技大学学者介绍[J];华中科技大学学报(社会科学版);2007年03期
4 ;全国优秀博士学位论文评选高等学校排名表[J];中国高等教育;2002年02期
5 王光菊;蔡剑锋;李文灿;;博士学位论文质量内部管理与外部监督实践——基于国家学位论文抽检制度[J];宁德师范学院学报(自然科学版);2016年04期
6 黄思记;李申申;;俄罗斯副博士学位论文评阅模式及其合理借鉴[J];研究生教育研究;2016年06期
7 靳冬欢;吴丹;刘利;方毅;;博士学位论文分级评阅制度的探索与实践[J];学位与研究生教育;2017年02期
8 许丹东;吕林海;;知识生产模式视角下的博士学位论文评价理念及标准初探[J];学位与研究生教育;2017年02期
9 ;研究生培养成就[J];中国林业教育;2017年S1期
10 刘爱忠;陈龙;;对提高博士研究生培养质量的分析与思考——基于公共卫生与预防医学学科全国优秀博士学位论文[J];教育教学论坛;2015年12期
中国重要会议论文全文数据库 前10条
1 周金华;;2011年毛泽东研究博士学位论文篇目辑览[A];毛泽东研究2012年卷[C];2013年
2 ;关于推荐2009年CCF优秀博士学位论文的通知[A];2009年全国开放式分布与并行计算机学术会议论文集(下册)[C];2009年
3 ;关于推荐2009年CCF优秀博士学位论文的通知[A];2009年全国开放式分布与并行计算机学术会议论文集(上册)[C];2009年
4 ;华中科技大学能源与动力工程学院[A];2007年鄂、皖、苏、冀四省电机工程学会汽轮机专业学术研讨会论文集(湖北卷)[C];2007年
5 ;华中科技大学信息学院光电子工程系情况简介[A];大珩先生九十华诞文集暨中国光学学会2004年学术大会论文集[C];2004年
6 ;前言[A];全国高校科协发展论坛(2013年)论文集[C];2013年
7 周济;;育人为本 三足鼎立——21世纪一流大学的发展思路[A];海峡两岸面向21世纪科技教育创新研讨会论文集[C];2000年
8 柳会祥;;坚持与时俱进 促进高校科协发展[A];全国高校科协发展论坛(2013年)论文集[C];2013年
9 邵培仁;廖卫民;;美国传播学博士学位论文的选题现状、趋势及问题探析[A];数字未来与媒介社会1[C];2010年
10 何志武;;从交叉、融合到协同的演绎——华中科技大学新闻传播学科历史与未来的选择[A];新闻学论集第29辑[C];2013年
中国重要报纸全文数据库 前10条
1 记者 宗河;博士学位论文抽检比例为10%左右[N];中国教育报;2014年
2 记者 王庆环;2013年全国优秀博士学位论文公布[N];光明日报;2014年
3 本报记者 明海英;提高博士学位论文整体质量仍需多方努力[N];中国社会科学报;2014年
4 侯红武 秦洋;山大青年教授殷杰获全国优秀博士学位论文奖[N];山西日报;2004年
5 记者 朱建华 实习生 周恋芹 张伟峰;武汉7篇博士学位论文入围“全国优秀”[N];长江日报;2010年
6 王握文周珞晶 吴丹;11篇全国优秀博士学位论文的启示[N];科技日报;2008年
7 本报记者 王握文 通讯员 周珞晶;11篇全国优秀博士学位论文的背后[N];解放军报;2008年
8 记者 清周 清新 彦生;我校11篇论文入选2007年全国优秀博士学位论文[N];新清华;2007年
9 记者 欧阳春艳;全国百篇优秀博士学位论文评选揭晓[N];长江日报;2005年
10 姜恩 张淑会;我省首获全国优秀博士学位论文奖[N];河北日报;2004年
中国博士学位论文全文数据库 前10条
1 Shadi Ibrahim;[D];华中科技大学;2011年
2 宫丽;精神家园论[D];华中科技大学;2011年
3 Ibrahim Khider Eltahir Khider;[D];华中科技大学;2008年
4 乔风;[D];华中科技大学;2011年
5 杨道州;对抗焦虑—方方小说新论[D];华中科技大学;2015年
6 Sarah Mustafa Eljack;[D];华中科技大学;2010年
7 郑剑;社会资本论[D];华中科技大学;2011年
8 胡沈明;现代新闻评论宽容意识研究[D];华中科技大学;2011年
9 吴剑平;中国社会转型中的政府俘获行为研究[D];华中科技大学;2012年
10 M.DIOUBA SACKO;[D];华中科技大学;2008年
中国硕士学位论文全文数据库 前10条
1 廉阳;博士学位论文质量的影响因素及预测研究[D];华中师范大学;2018年
2 马玲;写作者身份视阈下博士学位论文读者参与度研究[D];湖北大学;2018年
3 张志玉;经济学类博士学位论文数据挖掘研究[D];广东财经大学;2018年
4 邢铎;对我国田径方向博士学位论文的元分析[D];西北师范大学;2016年
5 吴弘萍;医学学科全国优秀博士学位论文产出分析[D];浙江大学;2007年
6 张宇航;博士学位论文评阅制度改革研究与实践[D];国防科学技术大学;2006年
7 段俊英;辽宁省工科博士学位论文质量存在的问题与对策研究[D];东北大学;2010年
8 熊沂;华中科技大学远缘跨学科研究的发展研究[D];华中科技大学;2013年
9 王琳娜;军队医学院校博士学位论文质量评价指标体系的构建[D];第三军医大学;2005年
10 李强;华中科技大学学生作风问题研究[D];华中科技大学;2016年
中国知网广告投放
 快捷付款方式  订购知网充值卡  订购热线  帮助中心
  • 400-819-9993
  • 010-62791813
  • 010-62985026