(1月14日)Characterization of MapReduce Applications on Private and Public Cloud Platforms

发布时间：2014-01-13

主讲人：Fan Zhang

主题：Characterization of MapReduce Applications on Private and Public Cloud Platforms

时间：2014年1月14日上午10:00

地点：清华A楼 412会议室

主讲人简介:

Dr. Fan Zhang is currently a postdoctoral associate with the Kavli Institute for Astrophysics and Space Research at Massachusetts Institute of Technology. He is also a sponsored scientist in Tsinghua University, Beijing, China. He received his Ph.D. in Department of Control Science and Engineering, Tsinghua University in Jan. 2012. From 2011 to 2013 he was a research scientist at Cloud Computing Laboratory, Carnegie Mellon University. An IEEE Senior Member, he received an Honorarium Research Funding Award from the University of Chicago and Argonne National Laboratory (2013), a Meritorious Service Award (2013) from IEEE Transactions on Service Computing, two IBM Ph.D. Fellowship Awards (2010 and 2011). His research interests include big-data scientific computing applications, simulation-based optimization approaches, cloud computing, and novel programming models for streaming data applications on elastic cloud platforms.

报告摘要:

The MapReduce programming model is a widely accepted solution to address the rapid growth of big-data processing demands. Various MapReduce applications with a very large volume of input data can run on an elastic compute cloud composed of many distributed computing instances. A public cloud provider, such as Amazon EC2, offers a spectrum of cloud resources with varying costs. Cloud users typically rent these elastic cloud resources as virtual machines (VMs) in a pay-as-you-go model to have access to large scale cloud resources. However, different applications scale differently based on their type, behavior and effective use of resources available.

In this work, we attempt to characterize how MapReduce performance is affected by increased compute resources for a variety of application types. These applications span across data- and compute-intensive benchmarks. Our major findings are as follows: (1) The execution times of MapReduce applications follow a power-law distribution, (2) For map-intensive applications, the power-law scalability starts from a small cluster size, and (3) For reduce-intensive applications, the power-law scalability starts from a lager cluster size.

Our research has also developed an in-depth understanding of MapReduce application performance and analyzed the impact of scaling input datasets. While we might expect that "embarrassingly parallel" MapReduce jobs should scale linearly with input dataset size, our results show that execution time sometimes increases nonlinearly. These results show that our execution-time analysis distinguishes four typical application behaviors when scaling input datasets.

Our characterization work will aid users in choosing appropriate computing resources, both virtual and physical, from small-scale experimental test runs. These approaches will predict performance speedups or slowdowns for MapReduce applications when scaling the infrastructure or the input datasets.

学术动态

常用链接