The natural sciences and engineer-ing research council canada
Improving MapReduce Performance in Heterogeneous Environments
Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica
attractive for ad-hoc parallel processing of arbitrary data. MapReduce breaks a computation into small tasks that run in parallel on multiple machines, and scales easily to very large clusters of inexpensive commodity comput-
| duce enjoying wide adoption and is often used for short | ers. | Its popular open-source implementation, Hadoop |
|---|
jobs where low response time is critical. Hadoop’s per-formance is closely tied to its task scheduler, which im-plicitly assumes that cluster nodes are homogeneous and tasks make progress linearly, and uses these assumptions to decide when to speculatively re-execute tasks that ap-pear to be stragglers. In practice, the homogeneity as-sumptions do not always hold. An especially compelling setting where this occurs is a virtualized data center, such as Amazon’s Elastic Compute Cloud (EC2). We show that Hadoop’s scheduler can cause severe performance degradation in heterogeneous environments. We design a new scheduling algorithm, Longest Approximate Time to End (LATE), that is highly robust to heterogeneity. LATE can improve Hadoop response times by a factor of 2 in clusters of 200 virtual machines on EC2.
| 1 | would be as slow as the misbehaving task. Stragglers can | |
|---|---|---|
| arise for many reasons, including faulty hardware and |
misconfiguration. Google has noted that speculative ex-ecution can improve job response times by 44% [1].
In this work, we address the problem of how to ro-bustly perform speculative execution to maximize per-formance. Hadoop’s scheduler starts speculative tasks based on a simple heuristic comparing each task’s progress to the average progress. Although this heuristic works well in homogeneous environments where strag-glers are obvious, we show that it can lead to severe per-formance degradation when its underlying assumptions are broken. We design an improved scheduling algorithm that reduces Hadoop’s response time by a factor of 2.
| An | especially | compelling | environment | where |
|---|


Hadoop’s scheduler is inadequate is a virtualized data
center. Virtualized “utility computing” environments,
such as Amazon’s Elastic Compute Cloud (EC2) [3], are
becoming an important tool for organizations that must
process large amounts of data, because large numbers
of virtual machines can be rented by the hour at lower
costs than operating a data center year-round (EC2’s
current cost is $0.10 per CPU hour). For example,
the New York Times rented 100 virtual machines for a
day to convert 11 million scanned articles to PDFs [7].
Utility computing environments provide an economic
advantage (paying by the hour), but they come with the
caveat of having to run on virtualized resources with
uncontrollable variations in performance. We also ex-
| use virtualization to simplify management and consoli- | 2 | |
|---|---|---|
| date servers. We observed that Hadoop’s homogeneity |
Na¨ıvely, one might expect speculative execution to be a simple matter of duplicating tasks that are sufficiently slow. In reality, it is a complex issue for several reasons. First, speculative tasks are not free – they compete for certain resources, such as the network, with other run-ning tasks. Second, choosing the node to run a specula-tive task on is as important as choosing the task. Third, in a heterogeneous environment, it may be difficult to dis-tinguish between nodes that are slightly slower than the mean and stragglers. Finally, stragglers should be identi-fied as early as possible to reduce response times.
Starting from first principles, we design a simple al-gorithm for speculative execution that is robust to het-erogeneity and highly effective in practice. We call our algorithm LATE for Longest Approximate Time to End. LATE is based on three principles: prioritizing tasks to speculate, selecting fast nodes to run on, and capping speculative tasks to prevent thrashing. We show that LATE can improve the response time of MapReduce jobs by a factor of 2 in large clusters on EC2.
The goal of speculative execution is to minimize a job’s response time. Response time is most important for short jobs where a user wants an answer quickly, such as queries on log data for debugging, monitoring and busi-ness intelligence. Short jobs are a major use case for MapReduce. For example, the average MapReduce job at Google in September 2007 took 395 seconds [1]. Sys-tems designed for SQL-like queries on top of MapRe-
|
Section 4 introduces our new | duce, such as Sawzall [9] and Pig [10], underline the im- |
|---|
| USENIX Association |
|---|
| the only metric of interest, because speculative tasks im- | 2.2 |
|
|---|
ply wasted work. However, even in pure throughput sys-tems, speculation may be beneficial to prevent the pro-longed life of many concurrent jobs all suffering from straggler tasks. Such nearly complete jobs occupy re-sources on the master and disk space for map outputs on the slaves until they terminate. Nonetheless, in our work, we focus on improving response time for short jobs.
| 2.1 |
|
|
|---|---|---|
|
• The reduce phase, when a user-defined function is applied to the list of map outputs with each key.
In each phase, the score is the fraction of data processed. For example, a task halfway through the copy phase has a progress score of1 2·
13=16, while a task halfway through the reduce phase scores1 3+13+ (12·13) =56.Assumption 6 is inherent in the MapReduce paradigm, so we do not address it in this paper. Tasks in MapReduce should be small, otherwise a single large task will slow down the entire job. In a well-behaved MapReduce job, the separation of input into equal chunks and the division of the key space among reducers ensures roughly equal amounts of work. If this is not the case, then launching a few extra speculative tasks is not harmful as long as obvious stragglers are also detected.
|
3 | |
|---|---|---|
| category of tasks (maps and reduces) to define a thresh- | ||
| old for speculative execution: When a task’s progress | 3.1 | |
| score is less than the average for its category minus 0.2, |
Finally, when running multiple jobs, Hadoop uses a FIFO discipline where the earliest submitted job is asked for a task to run, then the second, etc. There is also a pri-ority system for putting jobs into higher-priority queues.
The first two assumptions in Section 2.2 are about ho-mogeneity: Hadoop assumes that any detectably slow node is faulty. However, nodes can be slow for other reasons. In a non-virtualized data center, there may be multiple generations of hardware. In a virtualized data center where multiple virtual machines run on each phys-ical host, such as Amazon EC2, co-location of VMs may cause heterogeneity. Although virtualization iso-lates CPU and memory performance, VMs compete for disk and network bandwidth. In EC2, co-located VMs use a host’s full bandwidth when there is no contention and share bandwidth fairly when there is contention [12]. Contention can come from other users’ VMs, in which case it may be transient, or from a user’s own VMs if they do similar work, as in Hadoop. In Section 5.1, we
Heterogeneity seriously impacts Hadoop’s scheduler. Because the scheduler uses a fixed threshold for select-ing tasks to speculate, too many speculative tasks may be launched, taking away resources from useful tasks (assumption 3 is also untrue). Also, because the sched-uler ranks candidates by locality, the wrong tasks may be chosen for speculation first. For example, if the average progress was 70% and there was a 2x slower task at 35% progress and a 10x slower task at 7% progress, then the 2x slower task might be speculated before the 10x slower task if its input data was available on an idle node.
We note that EC2 also provides “large” and “extra large” VM sizes that have lower variance in I/O perfor-mance than the default “small” VMs, possibly because they fully own a disk. However, small VMs can achieve higher I/O performance per dollar because they use all available disk bandwidth when no other VMs on the host
| are using it. Larger VMs also still compete for network | 4 | |
|---|---|---|
| bandwidth. Therefore, we focus on optimizing Hadoop |
on “small” VMs to get the best performance per dollar.
We have designed a new speculative task scheduler by starting from first principles and adding features needed
| 3.2 | ||
|---|---|---|
Assumption 3, that speculating tasks on idle nodes costs nothing, breaks down when resources are shared. For example, the network is a bottleneck shared resource in large MapReduce jobs. Also, speculative tasks may compete for disk I/O in I/O-bound jobs. Finally, when multiple jobs are submitted, needless speculation reduces throughput without improving response time by occupy-ing nodes that could be running the next job.
Assumption 4, that a task’s progress score is approxi-mately equal to its percent completion, can cause incor-rect speculation of reducers. In a typical MapReduce job, the copy phase of reduce tasks is the slowest, because it involves all-pairs communication over the network. Tasks quickly complete the other two phases once they have all map outputs. However, the copy phase counts for only 1/3 of the progress score. Thus, soon after the first few reducers in a job finish the copy phase, their progress goes from 1/3 to 1, greatly increasing the aver-age progress. As soon as about 30% of reducers finish, the average progress is roughly 0.3·1+0.7·1/3 ≈53%, and now all reducers still in the copy phase will be 20% behind the average, and an arbitrary set will be specu-latively executed. Task slots will fill up, and true strag-
|
USENIX Association |
|---|
scores for all succeeded and in-progress tasks on the node). This heuristic leads to better performance than as-
• A cap on the number of speculative tasks that can be running at once, which we denote SpeculativeCap.
• A SlowTaskThreshold that a task’s progress rate is compared with to determine whether it is “slow enough” to be speculated upon. This prevents need-less speculation when only fast tasks are running.
Like Hadoop’s scheduler, we also wait until a task has run for 1 minute before evaluating it for speculation.
In practice, we have found that a good choice for the three parameters to LATE are to set the SpeculativeCap to 10% of available task slots and set the SlowNode-Threshold and SlowTaskThreshold to the 25th percentile of node progress and task progress rates respectively. We use these values in our evaluation. We have performed a sensitivity analysis in Section 5.4 to show that a wide range of thresholds perform well.
4.2 Estimating Finish Times
| 4.1 |
|
|---|
|
|
|---|
| | | | | | | | ||
|---|---|---|---|---|---|---|---|---|
| | ||||||||
| | | |||||||
|
||||||||
| | | |||||||
| | | For the less typical MapReduce jobs where some of | ||||||
| |
|
|||||||
| | ||||||||
| | ||||||||
| | ||||||||
| | ||||||||
| |
|
Figure 2: A scenario where LATE estimates task finish orders incorrectly.
| second phase. The task spends 10 seconds in the first | 5 | |
|---|---|---|
| phase and 50 seconds in the second phase, or 60s in to- |
tal. Now suppose that we launch two copies of the task,
T1 and T2, one at time 0 and one at time 10, and that we check their progress rates at time 20. Figure 2 illus-trates this scenario. At time 20, T1 will have finished its first phase and be one fifth through its second phase, so its progress score will be 60%, and its progress rate will be 60%/20s = 3%/s. Meanwhile, T2 will have just finished its first phase, so its progress rate will be 50%/10s = 5%/s. The estimated time left for T1 will be (100% −60%)/(3%/s) = 13.3s. The estimated time left for T2 will be (100%−50%)/(5%/s) = 10s. There-fore our heuristic will say that T1 will take longer to run than T2, while in reality T2 finishes second.
|
|
USENIX Association |
|---|
| Scale (VMs) |
|---|
| EC2 production | 871 | Measuring heterogeneity |
|---|---|---|
| EC2 test cluster | 100-243 | |
| 15 |
|
|
| EC2 production | 40 |
|
| Load Level | VMs | Write Perf (MB/s) | Std Dev |
|---|
| 1 VMs/host | 202 | 61.8 | 4.9 |
|---|---|---|---|
| 2 VMs/host | 264 | 56.5 | 10.0 |
| 3 VMs/host | 201 | 53.6 | 11.2 |
| 4 VMs/host | 140 | 46.4 | 11.9 |
| 5 VMs/host | 45 | 34.2 | 7.9 |
| 6 VMs/host | 12 | 25.4 | 2.5 |
| 7 VMs/host | 7 | 24.8 | 0.9 |
memory-bound workloads. However, the results are rel-evant to users running Hadoop at large scales on EC2, because these users will likely have co-located VMs (as we did) and Hadoop is an I/O-intensive workload.
| alent of a 1.0-1.2 GHz 2007 Opteron or Xeon proces- | 5.1.1 |
|
|---|---|---|
| sor,” and 160 GB of disk space on potentially shared hard |
We used a traceroute from each VM to an exter-nal URL to figure out which physical machine the VM was on – the first hop from a Xen virtual machine is al-ways the dom0 or supervisor process for that physical host. Our 871 VMs ranged from 202 that were alone on their physical host up to 7 VMs located on one physical host. Table 2 shows average performance and standard
| 5.1 |
|
|
|---|---|---|
To validate that the performance was tied to contention for disk resources due to multiple VMs writing on the same host, we also tried performing dd’s in a smaller EC2 allocation where 200 VMs were assigned to 200 distinct physical hosts. In this environment, dd perfor-mance was between 51 and 72 MB/s for all but three VMs. These achieved 44, 36 and 17 MB/s respectively. We do not know the cause of these stragglers. The nodes with 44 and 36 MB/s could be explained by contention with other users’ VMs given our previous measurements, but the node with 17 MB/s might be a truly faulty ma-chine. From these results, we conclude that background load is an important factor in I/O performance on EC2, and can reduce I/O performance by a factor of 2.5. We also see that stragglers can occur “in the wild” on EC2.
We also measured I/O performance on “large” and
“extra-large” EC2 VMs. These VMs have 2 and 4 virtual disks respectively, which appear to be independent. They achieve 50-60 MB/s performance on each disk. How-ever, a large VM costs 4x more than a small one, and an extra-large costs 8x more. Thus the I/O performance per dollar is on average less than that of small VMs.
5.1.2 Impact of Contention at the Application Level
| Load Level | Hosts | VMs |
|---|
| 1 VMs/host | 40 | 40 |
|---|---|---|
| 2 VMs/host | 20 | 40 |
| 3 VMs/host | 15 | 45 |
| 4 VMs/host | 10 | 40 |
| 5 VMs/host | 8 | 40 |
| 6 VMs/host | 4 | 24 |
| 7 VMs/host | 2 | 14 |
| Total | 99 | 243 |
| | | | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| | ||||||||||||||||||||||||||
| | ||||||||||||||||||||||||||
| | ||||||||||||||||||||||||||
| | ||||||||||||||||||||||||||
| | | |||||||||||||||||||||||||
| | | | ||||||||||||||||||||||||
| | | |||||||||||||||||||||||||
| | ||||||||||||||||||||||||||
| | | | ||||||||||||||||||||||||
| 5.2 | ||
|---|---|---|
|
We evaluated LATE, Hadoop’s native scheduler, and no speculation in a variety of experiments on EC2, on clus-ters of about 200 VMs. For each experiment in this sec-tion, we performed 5-7 runs. Due to the environment’s variability, some of the results had high variance. To ad-dress this issue, we show the average, worst and best-case performance for LATE in our results. We also ran experiments on a smaller local cluster where we had full control over the environment for further validation.
As our workload, we used a Sort job on a data set of 128 MB per host, or 30 GB of total data. Each job had 486 map tasks and 437 reduce tasks (Hadoop leaves some reduce capacity free for speculative and failed tasks). We repeated the experiment 6 times.
Figure 3 shows the response time achieved by each scheduler. Our graphs throughout this section show nor-malized performance against that of Hadoop’s native scheduler. We show the worst-case and best-case gain from LATE to give an idea of the range involved, be-cause the variance is high. On average, in this first ex-periment, LATE finished jobs 27% faster than Hadoop’s native scheduler and 31% faster than no speculation.
| 5.2.2 | |||
|---|---|---|---|
| 5.2.1 | |||
For our first experiment, we created a heterogeneous cluster by assigning different numbers of VMs to physi-cal hosts. We used 1 to 7 VMs per host, for a total of 243
with background processes to simulate stragglers. The other machines were assigned between 1 and 8 VMs per host, with about 10 in each load level. The stragglers
| USENIX Association |
|---|
| | | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| | ||||||||||||||||||||
| | ||||||||||||||||||||
| |
|
|||||||||||||||||||
| |
|
|||||||||||||||||||
| | | |||||||||||||||||||
| | ||||||||||||||||||||
| | | |||||||||||||||||||
| | | | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| | ||||||||||||||||||||||||||
| | ||||||||||||||||||||||||||
| | ||||||||||||||||||||||||||
| | ||||||||||||||||||||||||||
| | | |||||||||||||||||||||||||
| | | | ||||||||||||||||||||||||
| | | |||||||||||||||||||||||||
| | ||||||||||||||||||||||||||
| | |
|
||||||||||||||||||||||||
Figure 5: EC2 Grep running times with stragglers: Worst, best and average-case performance of LATE against Hadoop’s scheduler and no speculation.
| | | | |||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| | |||||||||||||||||||||||
| | |||||||||||||||||||||||
| | |||||||||||||||||||||||
| | |||||||||||||||||||||||
| | |||||||||||||||||||||||
| | |||||||||||||||||||||||
| | |||||||||||||||||||||||
| | |||||||||||||||||||||||
| | |||||||||||||||||||||||
| | |||||||||||||||||||||||
| | | ||||||||||||||||||||||
| 5.2.3 |
|---|
To validate our use of the Sort benchmark, we also ran two other workloads, Grep and WordCount, on a hetero-geneous cluster with stragglers. These are example jobs that come with the Hadoop distribution. We used a 204-node cluster with 1 to 8 VMs per physical host. We sim-ulated eight stragglers with background load as above.
Grep searches for a regular expression in a text file and creates a file with matches. It then launches a second MapReduce job to sort the matches. We only measured performance of the search job because the sort job was too short for speculative execution to activate (less than a minute). We applied Grep to 43 GB of text data (re-peated copies of Shakespeare’s plays), or about 200 MB per host. We searched for the regular expression “the”. Results from 5 runs are shown in Figure 5. On aver-age, LATE finished jobs 36% faster than Hadoop’s native scheduler and 57% faster than no speculation.
|
|
|---|
| Load Level | VMs | Write Perf (MB/s) | Std Dev |
|---|
| 1 VMs/host | 5 | 52.1 | 13.7 |
|---|---|---|---|
| 2 VMs/host | 6 | 20.9 | 2.7 |
| 4 VMs/host | 4 | 10.1 | 1.1 |
| Load Level | Hosts | VMs |
|---|
| 1 VMs/host | 5 | 5 |
|---|---|---|
| 2 VMs/host | 3 | 6 |
| 4 VMs/host | 1 | 4 |
| Total | 9 | 15 |
5.3 Local Testbed Experiments
In order to validate our results from EC2 in a more tightly controlled environment, we also ran a local cluster of 9 physical hosts running Xen virtualization software [13].
| | | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| | |||||||||||||||||||||||
| | |||||||||||||||||||||||
| | |||||||||||||||||||||||
| | | ||||||||||||||||||||||
| | | ||||||||||||||||||||||
| | |||||||||||||||||||||||
| | |||||||||||||||||||||||
| | |
|
|||||||||||||||||||||
|
|||||||||||||||||||||||
| | | ||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| | |||||||||||||||||||||||||||||||||||
| | |||||||||||||||||||||||||||||||||||
| | |||||||||||||||||||||||||||||||||||
| | |||||||||||||||||||||||||||||||||||
| | | ||||||||||||||||||||||||||||||||||
| | |||||||||||||||||||||||||||||||||||
| | |||||||||||||||||||||||||||||||||||
| | |||||||||||||||||||||||||||||||||||
| | |||||||||||||||||||||||||||||||||||
| | |||||||||||||||||||||||||||||||||||
| | |
|
|||||||||||||||||||||||||||||||||
Figure 8: Local Sort with stragglers: Worst, best and average-case times for LATE against Hadoop’s scheduler and no spec-ulation.
| Opteron processors with 4 GB of memory and a single | 5.3.2 | |
|---|---|---|
| 250GB SATA drive. On each physical machine, we ran |
| 5.3.1 |
|
|
|---|---|---|
|
|
|
USENIX Association |
|---|
| | | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| | ||||||||||||||||||||
| | ||||||||||||||||||||
| | | | ||||||||||||||||||
| | ||||||||||||||||||||
| | | | ||||||||||||||||||
| | ||||||||||||||||||||
| | | |||||||||||||||||||
n o speculation.
| 5.4 |
|
|
|---|---|---|
|
| load, where each machine was deterministically ”slowed | 5.4.1 | |
|---|---|---|
| down” by a different amount. The reason was that, by |
We started by varying SpeculativeCap, that is, the per-centage of slots that can be used for speculative tasks
| initially put is in the test cluster was fixed. In the pro- | at any given time. |
|
|---|
| We | see | that | response | time | drops | sharply |
|
|---|
SpeculativeCap = 20%, after which it stays low. Thus we postulate that any value of SpeculativeCap beyond some minimum threshold needed to specula-tively re-execute the severe stragglers will be adequate, as LATE will prioritize the slowest stragglers first. Of course, a higher threshold value is undesirable because LATE wastes more time on excess speculation. However, we see that the amount of wasted time does not grow rapidly, so there is a wide operating range. It is also interesting to note that at a low threshold of 10%, we have more wasted time than at 20%, because while fewer speculative tasks are launched, the job runs longer, so more time is wasted in tasks that eventually get killed.
had a response time of 247s (std dev 22s), and wasted
Figure 12: Performance versus SlowNodeThreshold.
| time of 35s/node (std dev 16s), both of which are worse | 5.4.3 | |
|---|---|---|
| than LATE with SlowCapThreshold = 20%. No spec- |
| Finally, | we | note | that | the | value | for | |
|---|---|---|---|---|---|---|---|
|
in | these | sensitivity | experiments, | |||
20%, was larger than the value we used in our eval-uation on EC2, 10%. The 10% threshold probably performed poorly in the sensitivity experiment because 6 out of our 40 nodes, or about 15%, were slow (by 3x or 10x). Unfortunately, it was too late for us to re-run our EC2 test cluster experiments with other values of SpeculativeCap, because we no longer had access to the test cluster. Nonetheless, we believe that performance in those experiments could only have gotten better with a larger SpeculativeCap, because the sensitivity results presented here show that after some minimum threshold, response time stays low and wasted work does not increase greatly. It is also possible that there were few enough stragglers in the large-scale experiments that a 10% cap was already high enough.
| 5.4.2 | ||
|---|---|---|
|
SlowTaskThreshold is the percentile of progress rate below which a task must lie to be considered for specula-tion (e.g. slowest 5%). The idea is to avoid wasted work by not speculating tasks that are progressing fast when they are the only non-speculated tasks left. For our tests varying this threshold, we set SpeculativeCap to the best value from the previous experiment, 20%, and set SlowNodeThreshold to 25%, a well-performing value. We tested 6 values of SlowTaskThreshold, from 5% to 100%. Figure 11 shows the results. We see again that while small threshold values harmfully limit the number
| USENIX Association |
|---|
For completeness, Figure 12 shows the results of vary-ing SlowNodeThreshold from 5% to 50% while fixing SpeculativeCap = 20% and SlowTaskThreshold = 25%. As noted, the threshold has no significant effect on performance. However, it is comforting to see that the very high threshold of 50% did not lead to a decrease in performance by unnecessarily limiting the set of nodes we can run speculative tasks on. This further supports the argument that, as long as SlowNodeThreshold is higher than the fraction of nodes that are extremely slow or faulty, LATE performs well.
tifying slow tasks eventually is not enough. What mat-ters is identifying the tasks that will hurt response time the most, and doing so as early as possible. Identifying a task as a laggard when it has run for more than two standard deviations than the mean is not very helpful for reducing response time: by this time, the job could have already run 3x longer than it should have! For this rea-son, LATE is based on estimated time left, and can detect the slow task early on. A few other elements, such as a cap on speculative tasks, ensure reasonable behavior. Through our experience with Hadoop, we have gained substantial insight into the implications of heterogeneity on distributed applications. We take away four lessons:
| 6 | decisions on measurements of mean and variance. | |
|---|---|---|
| 2. Use finishing times, not progress rates, to prioritize |
3. Nodes are not equal. Avoid assigning speculative
intensive computing, spurred by decreased storage costs tasks to slow nodes.
| come a well-known technique, more and more organi- | 7 |
|---|
a relatively young web company, has built a 300-node data warehouse using Hadoop in the past two years [14]. Many other companies and research groups are also us-ing Hadoop [5, 6]. EC2 also growing in popularity. It powers a number of startup companies, and it has en-abled established organizations to rent capacity by the hour for running large computations [17]. Utility com-puting is also attractive to researchers, because it en-ables them to run scalability experiments without hav-ing to own large numbers of machines, as we did in our paper. Services like EC2 also level the playing field between research institutions by reducing infrastructure costs. Finally, even without utility computing motivat-ing our work, heterogeneity will be a problem in private data centers as multiple generations of hardware accu-mulate and virtualization starts being used for manage-ment and consolidation. These factors mean that dealing with stragglers in MapReduce-like workloads will be an increasingly important problem.
Although selecting speculative tasks initially seems like a simple problem, we have shown that it is surpris-ingly subtle. First, simple thresholds, such as Hadoop’s 20% progress rule, can fail in spectacular ways (see Sec-tion 3.2) when there is more heterogeneity than expected. Other work on identifying slow tasks, such as [15], sug-gests using the mean and the variance of the progress rate to set a threshold, which seems like a more reasonable
| interdependent due to intertask communication. |
|
||
|---|---|---|---|
|
41 | ||
Speculative execution in MapReduce shares some ideas with “speculative execution” in distributed file sys-
[1] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Communications of the ACM, 51 (1): 107-113, 2008.
[4] Yahoo! Launches World’s Largest Hadoop Production Application, http://tinyurl.com/2hgzv7 [5] Applications powered by Hadoop: http://wiki.
apache.org/hadoop/PoweredBy
[6] Presentations by S. Schlosser and J. Lin at the 2008 Hadoop Summit. tinyurl.com/4a6lza
| tails speculative execution for scheduling in a distributed |
|---|
der Creative Commons Attribution 2.5 License.
[9] R. Pike, S. Dorward, R. Griesemer, S. Quinlan. Interpret-
| Tomkins. |
|---|
and standard deviation can be computed with confidence.
Data Processing. ACM SIGMOD 2008, June 2008.
| 8 | put. Syst., 24 (4): 361-392, November 2006. | ||
|---|---|---|---|
| [12] Amazon EC2 Instance Types, | tinyurl.com/ | ||
| 3zjlrd | |||
[13] B.Dragovic, K.Fraser, S.Hand, T.Harris, A.Ho, I.Pratt, A.Warfield, P.Barham, and R.Neugebauer. Xen and the art of virtualization. ACM SOSP 2003.
[14] Personal communication with the Yahoo! Hadoop team and with Joydeep Sen Sarma from Facebook.
| 9 |
|
|
|
|---|---|---|---|
| assignment in a distributed system: Improving perfor- | |||
[21] S.Manoharan. Effect of task duplication on the assign-ment of dependency graphs. Parallel Comput., 27 (3): 257-268, 2001.
[22] Y. Su, M. Attariyan, J. Flinn AutoBash: improving con-figuration management with operating system causality analysis. ACM SOSP 2007.
| USENIX Association |
|---|





