|
@@ -38,21 +38,14 @@ Overview
|
|
|
|
|
|
GridMix is a benchmark for Hadoop clusters. It submits a mix of
|
|
|
synthetic jobs, modeling a profile mined from production loads.
|
|
|
-
|
|
|
-There exist three versions of the GridMix tool. This document
|
|
|
-discusses the third (checked into `src/contrib` ), distinct
|
|
|
-from the two checked into the `src/benchmarks` sub-directory.
|
|
|
-While the first two versions of the tool included stripped-down versions
|
|
|
-of common jobs, both were principally saturation tools for stressing the
|
|
|
-framework at scale. In support of a broader range of deployments and
|
|
|
-finer-tuned job mixes, this version of the tool will attempt to model
|
|
|
+This version of the tool will attempt to model
|
|
|
the resource profiles of production jobs to identify bottlenecks, guide
|
|
|
-development, and serve as a replacement for the existing GridMix
|
|
|
-benchmarks.
|
|
|
+development.
|
|
|
|
|
|
To run GridMix, you need a MapReduce job trace describing the job mix
|
|
|
-for a given cluster. Such traces are typically generated by Rumen (see
|
|
|
-Rumen documentation). GridMix also requires input data from which the
|
|
|
+for a given cluster. Such traces are typically generated by
|
|
|
+[Rumen](../hadoop-rumen/Rumen.html).
|
|
|
+GridMix also requires input data from which the
|
|
|
synthetic jobs will be reading bytes. The input data need not be in any
|
|
|
particular format, as the synthetic jobs are currently binary readers.
|
|
|
If you are running on a new cluster, an optional step generating input
|
|
@@ -62,10 +55,15 @@ on the same or another cluster, follow these steps:
|
|
|
|
|
|
1. Locate the job history files on the production cluster. This
|
|
|
location is specified by the
|
|
|
- `mapred.job.tracker.history.completed.location`
|
|
|
+ `mapreduce.jobhistory.done-dir` or
|
|
|
+ `mapreduce.jobhistory.intermediate-done-dir`
|
|
|
configuration property of the cluster.
|
|
|
-
|
|
|
-2. Run Rumen to build a job trace in JSON format for all or select jobs.
|
|
|
+ ([MapReduce historyserver](../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html#historyserver)
|
|
|
+ moves job history files from `mapreduce.jobhistory.done-dir`
|
|
|
+ to `mapreduce.jobhistory.intermediate-done-dir`.)
|
|
|
+
|
|
|
+2. Run [Rumen](../hadoop-rumen/Rumen.html)
|
|
|
+ to build a job trace in JSON format for all or select jobs.
|
|
|
|
|
|
3. Use GridMix with the job trace on the benchmark cluster.
|
|
|
|
|
@@ -79,13 +77,17 @@ Usage
|
|
|
|
|
|
Basic command-line usage without configuration parameters:
|
|
|
|
|
|
- org.apache.hadoop.mapred.gridmix.Gridmix [-generate <size>] [-users <users-list>] <iopath> <trace>
|
|
|
+```
|
|
|
+java org.apache.hadoop.mapred.gridmix.Gridmix [-generate <size>] [-users <users-list>] <iopath> <trace>
|
|
|
+```
|
|
|
|
|
|
Basic command-line usage with configuration parameters:
|
|
|
|
|
|
- org.apache.hadoop.mapred.gridmix.Gridmix \
|
|
|
- -Dgridmix.client.submit.threads=10 -Dgridmix.output.directory=foo \
|
|
|
- [-generate <size>] [-users <users-list>] <iopath> <trace>
|
|
|
+```
|
|
|
+java org.apache.hadoop.mapred.gridmix.Gridmix \
|
|
|
+ -Dgridmix.client.submit.threads=10 -Dgridmix.output.directory=foo \
|
|
|
+ [-generate <size>] [-users <users-list>] <iopath> <trace>
|
|
|
+```
|
|
|
|
|
|
> Configuration parameters like
|
|
|
> `-Dgridmix.client.submit.threads=10` and
|
|
@@ -102,6 +104,8 @@ The `-generate` option is used to generate input data and
|
|
|
Distributed Cache files for the synthetic jobs. It accepts standard units
|
|
|
of size suffixes, e.g. `100g` will generate
|
|
|
100 * 2<sup>30</sup> bytes as input data.
|
|
|
+The minimum size of input data in compressed format (128MB by default)
|
|
|
+is defined by `gridmix.min.file.size`.
|
|
|
`<iopath>/input` is the destination directory for
|
|
|
generated input data and/or the directory from which input data will be
|
|
|
read. HDFS-based Distributed Cache files are generated under the
|
|
@@ -121,16 +125,17 @@ uncompressed. Use "-" as the value of this parameter if you
|
|
|
want to pass an *uncompressed* trace via the standard
|
|
|
input-stream of GridMix.
|
|
|
|
|
|
-The class `org.apache.hadoop.mapred.gridmix.Gridmix` can
|
|
|
-be found in the JAR
|
|
|
-`contrib/gridmix/hadoop-gridmix-$VERSION.jar` inside your
|
|
|
-Hadoop installation, where `$VERSION` corresponds to the
|
|
|
-version of Hadoop installed. A simple way of ensuring that this class
|
|
|
-and all its dependencies are loaded correctly is to use the
|
|
|
-`hadoop` wrapper script in Hadoop:
|
|
|
+GridMix expects certain library *JARs* to be present in the *CLASSPATH*.
|
|
|
+One simple way to run GridMix is to use `hadoop jar` command to run it.
|
|
|
+You also need to add the JAR of Rumen to classpath for both of client and tasks
|
|
|
+as example shown below.
|
|
|
|
|
|
- hadoop jar <gridmix-jar> org.apache.hadoop.mapred.gridmix.Gridmix \
|
|
|
- [-generate <size>] [-users <users-list>] <iopath> <trace>
|
|
|
+```
|
|
|
+HADOOP_CLASSPATH=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-rumen-2.5.1.jar \
|
|
|
+ $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-gridmix-2.5.1.jar \
|
|
|
+ -libjars $HADOOP_HOME/share/hadoop/tools/lib/hadoop-rumen-2.5.1.jar \
|
|
|
+ [-generate <size>] [-users <users-list>] <iopath> <trace>
|
|
|
+```
|
|
|
|
|
|
The supported configuration parameters are explained in the
|
|
|
following sections.
|
|
@@ -262,14 +267,14 @@ recorded in the trace. It constructs jobs of two types:
|
|
|
</td>
|
|
|
<td>A synthetic job where each task does *nothing* but sleep
|
|
|
for a certain duration as observed in the production trace. The
|
|
|
- scalability of the Job Tracker is often limited by how many
|
|
|
+ scalability of the ResourceManager is often limited by how many
|
|
|
heartbeats it can handle every second. (Heartbeats are periodic
|
|
|
- messages sent from Task Trackers to update their status and grab new
|
|
|
- tasks from the Job Tracker.) Since a benchmark cluster is typically
|
|
|
+ messages sent from NodeManagers to update their status and grab new
|
|
|
+ tasks from the ResourceManager.) Since a benchmark cluster is typically
|
|
|
a fraction in size of a production cluster, the heartbeat traffic
|
|
|
generated by the slave nodes is well below the level of the
|
|
|
- production cluster. One possible solution is to run multiple Task
|
|
|
- Trackers on each slave node. This leads to the obvious problem that
|
|
|
+ production cluster. One possible solution is to run multiple
|
|
|
+ NodeManagers on each slave node. This leads to the obvious problem that
|
|
|
the I/O workload generated by the synthetic jobs would thrash the
|
|
|
slave nodes. Hence the need for such a job.</td>
|
|
|
</tr>
|
|
@@ -334,7 +339,7 @@ Job Submission Policies
|
|
|
|
|
|
GridMix controls the rate of job submission. This control can be
|
|
|
based on the trace information or can be based on statistics it gathers
|
|
|
-from the Job Tracker. Based on the submission policies users define,
|
|
|
+from the ResourceManager. Based on the submission policies users define,
|
|
|
GridMix uses the respective algorithm to control the job submission.
|
|
|
There are currently three types of policies:
|
|
|
|
|
@@ -407,9 +412,9 @@ The following configuration parameters affect the job submission policy:
|
|
|
<td>
|
|
|
<code>gridmix.throttle.jobs-to-tracker-ratio</code>
|
|
|
</td>
|
|
|
- <td>In STRESS mode, the minimum ratio of running jobs to Task
|
|
|
- Trackers in a cluster for the cluster to be considered
|
|
|
- *overloaded* . This is the threshold TJ referred to earlier.
|
|
|
+ <td>In STRESS mode, the minimum ratio of running jobs to
|
|
|
+ NodeManagers in a cluster for the cluster to be considered
|
|
|
+ *overloaded* . This is the threshold TJ referred to earlier.
|
|
|
The default is 1.0.</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
@@ -688,20 +693,16 @@ correctly emulate compression.
|
|
|
Emulating High-Ram jobs
|
|
|
-----------------------
|
|
|
|
|
|
-MapReduce allows users to define a job as a High-Ram job. Tasks from a
|
|
|
-High-Ram job can occupy multiple slots on the task-trackers.
|
|
|
-Task-tracker assigns fixed virtual memory for each slot. Tasks from
|
|
|
-High-Ram jobs can occupy multiple slots and thus can use up more
|
|
|
-virtual memory as compared to a default task.
|
|
|
-
|
|
|
-Emulating this behavior is important because of the following reasons
|
|
|
+MapReduce allows users to define a job as a High-Ram job. Tasks from a
|
|
|
+High-Ram job can occupy larger fraction of memory in task processes.
|
|
|
+Emulating this behavior is important because of the following reasons.
|
|
|
|
|
|
* Impact on scheduler: Scheduling of tasks from High-Ram jobs
|
|
|
- impacts the scheduling behavior as it might result into slot
|
|
|
- reservation and slot/resource utilization.
|
|
|
+ impacts the scheduling behavior as it might result into
|
|
|
+ resource reservation and utilization.
|
|
|
|
|
|
-* Impact on the node : Since High-Ram tasks occupy multiple slots,
|
|
|
- trackers do some bookkeeping for allocating extra resources for
|
|
|
+* Impact on the node : Since High-Ram tasks occupy larger memory,
|
|
|
+ NodeManagers do some bookkeeping for allocating extra resources for
|
|
|
these tasks. Thus this becomes a precursor for memory emulation
|
|
|
where tasks with high memory requirements needs to be considered
|
|
|
as a High-Ram task.
|
|
@@ -808,11 +809,11 @@ job traces and cannot be accurately reproduced in GridMix:
|
|
|
Appendix
|
|
|
--------
|
|
|
|
|
|
+There exist older versions of the GridMix tool.
|
|
|
Issues tracking the original implementations of
|
|
|
-<a href="https://issues.apache.org/jira/browse/HADOOP-2369">GridMix1</a>,
|
|
|
-<a href="https://issues.apache.org/jira/browse/HADOOP-3770">GridMix2</a>,
|
|
|
-and <a href="https://issues.apache.org/jira/browse/MAPREDUCE-776">GridMix3</a>
|
|
|
+[GridMix1](https://issues.apache.org/jira/browse/HADOOP-2369),
|
|
|
+[GridMix2](https://issues.apache.org/jira/browse/HADOOP-3770),
|
|
|
+and [GridMix3](https://issues.apache.org/jira/browse/MAPREDUCE-776)
|
|
|
can be found on the Apache Hadoop MapReduce JIRA. Other issues tracking
|
|
|
the current development of GridMix can be found by searching
|
|
|
-<a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313086">
|
|
|
-the Apache Hadoop MapReduce JIRA</a>
|
|
|
+[the Apache Hadoop MapReduce JIRA](https://issues.apache.org/jira/browse/MAPREDUCE/component/12313086).
|