há 10 anos atrás · a754f94ed7
--- a/hadoop-mapreduce-project/CHANGES.txt
+++ b/hadoop-mapreduce-project/CHANGES.txt
@@ -34,6 +34,8 @@ Release 2.7.0 - UNRELEASED
 
				 
			
 
				     MAPREDUCE-6150. Update document of Rumen (Masatake Iwasaki via aw)
			
 
				 
			
 
				+    MAPREDUCE-6151. Update document of GridMix (Masatake Iwasaki via aw)
			
 
				+
			
 
				   OPTIMIZATIONS
			
 
				 
			
 
				     MAPREDUCE-6169. MergeQueue should release reference to the current item 
			
--- a/hadoop-project/src/site/site.xml
+++ b/hadoop-project/src/site/site.xml
@@ -106,6 +106,7 @@
 
				       <item name="Hadoop Streaming" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html"/>
			
 
				       <item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/>
			
 
				       <item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/>
			
 
				+      <item name="GridMix" href="hadoop-gridmix/GridMix.html"/>
			
 
				       <item name="Rumen" href="hadoop-rumen/Rumen.html"/>
			
 
				     </menu>
			
 
				 
			
--- a/hadoop-tools/hadoop-gridmix/src/site/markdown/GridMix.md.vm
+++ b/hadoop-tools/hadoop-gridmix/src/site/markdown/GridMix.md.vm
@@ -38,21 +38,14 @@ Overview
 
				 
			
 
				 GridMix is a benchmark for Hadoop clusters. It submits a mix of
			
 
				 synthetic jobs, modeling a profile mined from production loads.
			
 
				-
			
 
				-There exist three versions of the GridMix tool. This document
			
 
				-discusses the third (checked into `src/contrib` ), distinct
			
 
				-from the two checked into the `src/benchmarks` sub-directory.
			
 
				-While the first two versions of the tool included stripped-down versions
			
 
				-of common jobs, both were principally saturation tools for stressing the
			
 
				-framework at scale. In support of a broader range of deployments and
			
 
				-finer-tuned job mixes, this version of the tool will attempt to model
			
 
				+This version of the tool will attempt to model
			
 
				 the resource profiles of production jobs to identify bottlenecks, guide
			
 
				-development, and serve as a replacement for the existing GridMix
			
 
				-benchmarks.
			
 
				+development.
			
 
				 
			
 
				 To run GridMix, you need a MapReduce job trace describing the job mix
			
 
				-for a given cluster. Such traces are typically generated by Rumen (see
			
 
				-Rumen documentation). GridMix also requires input data from which the
			
 
				+for a given cluster. Such traces are typically generated by
			
 
				+[Rumen](../hadoop-rumen/Rumen.html).
			
 
				+GridMix also requires input data from which the
			
 
				 synthetic jobs will be reading bytes. The input data need not be in any
			
 
				 particular format, as the synthetic jobs are currently binary readers.
			
 
				 If you are running on a new cluster, an optional step generating input
			
@@ -62,10 +55,15 @@ on the same or another cluster, follow these steps:
 
				 
			
 
				 1.  Locate the job history files on the production cluster. This
			
 
				     location is specified by the
			
 
				-    `mapred.job.tracker.history.completed.location`
			
 
				+    `mapreduce.jobhistory.done-dir` or 
			
 
				+    `mapreduce.jobhistory.intermediate-done-dir`
			
 
				     configuration property of the cluster.
			
 
				-
			
 
				-2.  Run Rumen to build a job trace in JSON format for all or select jobs.
			
 
				+    ([MapReduce historyserver](../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html#historyserver)
			
 
				+    moves job history files from `mapreduce.jobhistory.done-dir`
			
 
				+    to `mapreduce.jobhistory.intermediate-done-dir`.)
			
 
				+    
			
 
				+2.  Run [Rumen](../hadoop-rumen/Rumen.html)
			
 
				+    to build a job trace in JSON format for all or select jobs.
			
 
				 
			
 
				 3.  Use GridMix with the job trace on the benchmark cluster.
			
 
				 
			
@@ -79,13 +77,17 @@ Usage
 
				 
			
 
				 Basic command-line usage without configuration parameters:
			
 
				 
			
 
				-    org.apache.hadoop.mapred.gridmix.Gridmix [-generate <size>] [-users <users-list>] <iopath> <trace>
			
 
				+```
			
 
				+java org.apache.hadoop.mapred.gridmix.Gridmix [-generate <size>] [-users <users-list>] <iopath> <trace>
			
 
				+```
			
 
				 
			
 
				 Basic command-line usage with configuration parameters:
			
 
				 
			
 
				-    org.apache.hadoop.mapred.gridmix.Gridmix \
			
 
				-      -Dgridmix.client.submit.threads=10 -Dgridmix.output.directory=foo \
			
 
				-      [-generate <size>] [-users <users-list>] <iopath> <trace>
			
 
				+```
			
 
				+java org.apache.hadoop.mapred.gridmix.Gridmix \
			
 
				+  -Dgridmix.client.submit.threads=10 -Dgridmix.output.directory=foo \
			
 
				+  [-generate <size>] [-users <users-list>] <iopath> <trace>
			
 
				+```
			
 
				 
			
 
				 > Configuration parameters like
			
 
				 > `-Dgridmix.client.submit.threads=10` and
			
@@ -102,6 +104,8 @@ The `-generate` option is used to generate input data and
 
				 Distributed Cache files for the synthetic jobs. It accepts standard units
			
 
				 of size suffixes, e.g. `100g` will generate
			
 
				 100 * 2<sup>30</sup> bytes as input data.
			
 
				+The minimum size of input data in compressed format (128MB by default)
			
 
				+is defined by `gridmix.min.file.size`.
			
 
				 `<iopath>/input` is the destination directory for
			
 
				 generated input data and/or the directory from which input data will be
			
 
				 read. HDFS-based Distributed Cache files are generated under the
			
@@ -121,16 +125,17 @@ uncompressed. Use "-" as the value of this parameter if you
 
				 want to pass an  *uncompressed*  trace via the standard
			
 
				 input-stream of GridMix.
			
 
				 
			
 
				-The class `org.apache.hadoop.mapred.gridmix.Gridmix` can
			
 
				-be found in the JAR
			
 
				-`contrib/gridmix/hadoop-gridmix-$VERSION.jar` inside your
			
 
				-Hadoop installation, where `$VERSION` corresponds to the
			
 
				-version of Hadoop installed. A simple way of ensuring that this class
			
 
				-and all its dependencies are loaded correctly is to use the
			
 
				-`hadoop` wrapper script in Hadoop:
			
 
				+GridMix expects certain library *JARs* to be present in  the *CLASSPATH*.
			
 
				+One simple way to run GridMix is to use `hadoop jar` command to run it.
			
 
				+You also need to add the JAR of Rumen to classpath for both of client and tasks
			
 
				+as example shown below.
			
 
				 
			
 
				-    hadoop jar <gridmix-jar> org.apache.hadoop.mapred.gridmix.Gridmix \
			
 
				-      [-generate <size>] [-users <users-list>] <iopath> <trace>
			
 
				+```
			
 
				+HADOOP_CLASSPATH=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-rumen-2.5.1.jar \
			
 
				+  $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-gridmix-2.5.1.jar \
			
 
				+    -libjars $HADOOP_HOME/share/hadoop/tools/lib/hadoop-rumen-2.5.1.jar \
			
 
				+    [-generate <size>] [-users <users-list>] <iopath> <trace>
			
 
				+```
			
 
				 
			
 
				 The supported configuration parameters are explained in the
			
 
				 following sections.
			
@@ -262,14 +267,14 @@ recorded in the trace. It constructs jobs of two types:
 
				     </td>
			
 
				     <td>A synthetic job where each task does  *nothing*  but sleep
			
 
				     for a certain duration as observed in the production trace. The
			
 
				-    scalability of the Job Tracker is often limited by how many
			
 
				+    scalability of the ResourceManager is often limited by how many
			
 
				     heartbeats it can handle every second. (Heartbeats are periodic
			
 
				-    messages sent from Task Trackers to update their status and grab new
			
 
				-    tasks from the Job Tracker.) Since a benchmark cluster is typically
			
 
				+    messages sent from NodeManagers to update their status and grab new
			
 
				+    tasks from the ResourceManager.) Since a benchmark cluster is typically
			
 
				     a fraction in size of a production cluster, the heartbeat traffic
			
 
				     generated by the slave nodes is well below the level of the
			
 
				-    production cluster. One possible solution is to run multiple Task
			
 
				-    Trackers on each slave node. This leads to the obvious problem that
			
 
				+    production cluster. One possible solution is to run multiple
			
 
				+    NodeManagers on each slave node. This leads to the obvious problem that
			
 
				     the I/O workload generated by the synthetic jobs would thrash the
			
 
				     slave nodes. Hence the need for such a job.</td>
			
 
				   </tr>
			
@@ -334,7 +339,7 @@ Job Submission Policies
 
				 
			
 
				 GridMix controls the rate of job submission. This control can be
			
 
				 based on the trace information or can be based on statistics it gathers
			
 
				-from the Job Tracker. Based on the submission policies users define,
			
 
				+from the ResourceManager. Based on the submission policies users define,
			
 
				 GridMix uses the respective algorithm to control the job submission.
			
 
				 There are currently three types of policies:
			
 
				 
			
@@ -407,9 +412,9 @@ The following configuration parameters affect the job submission policy:
 
				     <td>
			
 
				       <code>gridmix.throttle.jobs-to-tracker-ratio</code>
			
 
				     </td>
			
 
				-    <td>In STRESS mode, the minimum ratio of running jobs to Task
			
 
				-    Trackers in a cluster for the cluster to be considered
			
 
				-     *overloaded* . This is the threshold TJ referred to earlier.
			
 
				+    <td>In STRESS mode, the minimum ratio of running jobs to
			
 
				+    NodeManagers in a cluster for the cluster to be considered
			
 
				+    *overloaded* . This is the threshold TJ referred to earlier.
			
 
				     The default is 1.0.</td>
			
 
				   </tr>
			
 
				   <tr>
			
@@ -688,20 +693,16 @@ correctly emulate compression.
 
				 Emulating High-Ram jobs
			
 
				 -----------------------
			
 
				 
			
 
				-MapReduce allows users to define a job as a High-Ram job. Tasks from a 
			
 
				-High-Ram job can occupy multiple slots on the task-trackers. 
			
 
				-Task-tracker assigns fixed virtual memory for each slot. Tasks from 
			
 
				-High-Ram jobs can occupy multiple slots and thus can use up more 
			
 
				-virtual memory as compared to a default task.
			
 
				-
			
 
				-Emulating this behavior is important because of the following reasons
			
 
				+MapReduce allows users to define a job as a High-Ram job. Tasks from a
			
 
				+High-Ram job can occupy larger fraction of memory in task processes.
			
 
				+Emulating this behavior is important because of the following reasons.
			
 
				 
			
 
				 * Impact on scheduler:  Scheduling of tasks from High-Ram jobs 
			
 
				-  impacts the scheduling behavior as it might result into slot 
			
 
				-  reservation and slot/resource utilization.
			
 
				+  impacts the scheduling behavior as it might result into
			
 
				+  resource reservation and utilization.
			
 
				 
			
 
				-* Impact on the node : Since High-Ram tasks occupy multiple slots,
			
 
				-  trackers do some bookkeeping for allocating extra resources for 
			
 
				+* Impact on the node : Since High-Ram tasks occupy larger memory,
			
 
				+  NodeManagers do some bookkeeping for allocating extra resources for 
			
 
				   these tasks. Thus this becomes a precursor for memory emulation
			
 
				   where tasks with high memory requirements needs to be considered
			
 
				   as a High-Ram task.
			
@@ -808,11 +809,11 @@ job traces and cannot be accurately reproduced in GridMix:
 
				 Appendix
			
 
				 --------
			
 
				 
			
 
				+There exist older versions of the GridMix tool.
			
 
				 Issues tracking the original implementations of
			
 
				-<a href="https://issues.apache.org/jira/browse/HADOOP-2369">GridMix1</a>,
			
 
				-<a href="https://issues.apache.org/jira/browse/HADOOP-3770">GridMix2</a>,
			
 
				-and <a href="https://issues.apache.org/jira/browse/MAPREDUCE-776">GridMix3</a>
			
 
				+[GridMix1](https://issues.apache.org/jira/browse/HADOOP-2369),
			
 
				+[GridMix2](https://issues.apache.org/jira/browse/HADOOP-3770),
			
 
				+and [GridMix3](https://issues.apache.org/jira/browse/MAPREDUCE-776)
			
 
				 can be found on the Apache Hadoop MapReduce JIRA. Other issues tracking
			
 
				 the current development of GridMix can be found by searching
			
 
				-<a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313086">
			
 
				-the Apache Hadoop MapReduce JIRA</a>
			
 
				+[the Apache Hadoop MapReduce JIRA](https://issues.apache.org/jira/browse/MAPREDUCE/component/12313086).