Forráskód Böngészése

HADOOP-10954. Adding site documents of hadoop-tools (Masatake Iwasaki via aw)

Allen Wittenauer 10 éve
szülő
commit
27f9edceed

+ 3 - 0
hadoop-common-project/hadoop-common/CHANGES.txt

@@ -212,6 +212,9 @@ Release 2.6.0 - UNRELEASED
     HADOOP-8808. Update FsShell documentation to mention deprecation of some 
     of the commands, and mention alternatives (Akira AJISAKA via aw)
 
+    HADOOP-10954. Adding site documents of hadoop-tools (Masatake Iwasaki 
+    via aw)
+
   OPTIMIZATIONS
 
     HADOOP-10838. Byte array native checksumming. (James Thomas via todd)

+ 818 - 0
hadoop-tools/hadoop-gridmix/src/site/markdown/GridMix.md.vm

@@ -0,0 +1,818 @@
+<!---
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+Gridmix
+=======
+
+---
+
+- [Overview](#Overview)
+- [Usage](#Usage)
+- [General Configuration Parameters](#General_Configuration_Parameters)
+- [Job Types](#Job_Types)
+- [Job Submission Policies](#Job_Submission_Policies)
+- [Emulating Users and Queues](#Emulating_Users_and_Queues)
+- [Emulating Distributed Cache Load](#Emulating_Distributed_Cache_Load)
+- [Configuration of Simulated Jobs](#Configuration_of_Simulated_Jobs)
+- [Emulating Compression/Decompression](#Emulating_CompressionDecompression)
+- [Emulating High-Ram jobs](#Emulating_High-Ram_jobs)
+- [Emulating resource usages](#Emulating_resource_usages)
+- [Simplifying Assumptions](#Simplifying_Assumptions)
+- [Appendix](#Appendix)
+
+---
+
+Overview
+--------
+
+GridMix is a benchmark for Hadoop clusters. It submits a mix of
+synthetic jobs, modeling a profile mined from production loads.
+
+There exist three versions of the GridMix tool. This document
+discusses the third (checked into `src/contrib` ), distinct
+from the two checked into the `src/benchmarks` sub-directory.
+While the first two versions of the tool included stripped-down versions
+of common jobs, both were principally saturation tools for stressing the
+framework at scale. In support of a broader range of deployments and
+finer-tuned job mixes, this version of the tool will attempt to model
+the resource profiles of production jobs to identify bottlenecks, guide
+development, and serve as a replacement for the existing GridMix
+benchmarks.
+
+To run GridMix, you need a MapReduce job trace describing the job mix
+for a given cluster. Such traces are typically generated by Rumen (see
+Rumen documentation). GridMix also requires input data from which the
+synthetic jobs will be reading bytes. The input data need not be in any
+particular format, as the synthetic jobs are currently binary readers.
+If you are running on a new cluster, an optional step generating input
+data may precede the run.
+In order to emulate the load of production jobs from a given cluster
+on the same or another cluster, follow these steps:
+
+1.  Locate the job history files on the production cluster. This
+    location is specified by the
+    `mapred.job.tracker.history.completed.location`
+    configuration property of the cluster.
+
+2.  Run Rumen to build a job trace in JSON format for all or select jobs.
+
+3.  Use GridMix with the job trace on the benchmark cluster.
+
+Jobs submitted by GridMix have names of the form
+"`GRIDMIXnnnnnn`", where
+"`nnnnnn`" is a sequence number padded with leading zeroes.
+
+
+Usage
+-----
+
+Basic command-line usage without configuration parameters:
+
+    org.apache.hadoop.mapred.gridmix.Gridmix [-generate <size>] [-users <users-list>] <iopath> <trace>
+
+Basic command-line usage with configuration parameters:
+
+    org.apache.hadoop.mapred.gridmix.Gridmix \
+      -Dgridmix.client.submit.threads=10 -Dgridmix.output.directory=foo \
+      [-generate <size>] [-users <users-list>] <iopath> <trace>
+
+> Configuration parameters like
+> `-Dgridmix.client.submit.threads=10` and
+> `-Dgridmix.output.directory=foo` as given above should
+> be used  *before*  other GridMix parameters.
+
+The `<iopath>` parameter is the working directory for
+GridMix. Note that this can either be on the local file-system
+or on HDFS, but it is highly recommended that it be the same as that for
+the original job mix so that GridMix puts the same load on the local
+file-system and HDFS respectively.
+
+The `-generate` option is used to generate input data and
+Distributed Cache files for the synthetic jobs. It accepts standard units
+of size suffixes, e.g. `100g` will generate
+100 * 2<sup>30</sup> bytes as input data.
+`<iopath>/input` is the destination directory for
+generated input data and/or the directory from which input data will be
+read. HDFS-based Distributed Cache files are generated under the
+distributed cache directory `<iopath>/distributedCache`.
+If some of the needed Distributed Cache files are already existing in the
+distributed cache directory, then only the remaining non-existing
+Distributed Cache files are generated when `-generate` option
+is specified.
+
+The `-users` option is used to point to a users-list
+file (see <a href="#usersqueues">Emulating Users and Queues</a>).
+
+The `<trace>` parameter is a path to a job trace
+generated by Rumen. This trace can be compressed (it must be readable
+using one of the compression codecs supported by the cluster) or
+uncompressed. Use "-" as the value of this parameter if you
+want to pass an  *uncompressed*  trace via the standard
+input-stream of GridMix.
+
+The class `org.apache.hadoop.mapred.gridmix.Gridmix` can
+be found in the JAR
+`contrib/gridmix/hadoop-gridmix-$VERSION.jar` inside your
+Hadoop installation, where `$VERSION` corresponds to the
+version of Hadoop installed. A simple way of ensuring that this class
+and all its dependencies are loaded correctly is to use the
+`hadoop` wrapper script in Hadoop:
+
+    hadoop jar <gridmix-jar> org.apache.hadoop.mapred.gridmix.Gridmix \
+      [-generate <size>] [-users <users-list>] <iopath> <trace>
+
+The supported configuration parameters are explained in the
+following sections.
+
+
+General Configuration Parameters
+--------------------------------
+
+<table>
+  <tr>
+    <th>Parameter</th>
+    <th>Description</th>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.output.directory</code>
+    </td>
+    <td>The directory into which output will be written. If specified,
+    <code>iopath</code> will be relative to this parameter. The
+    submitting user must have read/write access to this directory. The
+    user should also be mindful of any quota issues that may arise
+    during a run. The default is "<code>gridmix</code>".</td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.client.submit.threads</code>
+    </td>
+    <td>The number of threads submitting jobs to the cluster. This
+    also controls how many splits will be loaded into memory at a given
+    time, pending the submit time in the trace. Splits are pre-generated
+    to hit submission deadlines, so particularly dense traces may want
+    more submitting threads. However, storing splits in memory is
+    reasonably expensive, so you should raise this cautiously. The
+    default is 1 for the SERIAL job-submission policy (see
+    <a href="#policies">Job Submission Policies</a>) and one more than
+    the number of processors on the client machine for the other
+    policies.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.submit.multiplier</code>
+    </td>
+    <td>The multiplier to accelerate or decelerate the submission of
+    jobs. The time separating two jobs is multiplied by this factor.
+    The default value is 1.0. This is a crude mechanism to size
+    a job trace to a cluster.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.client.pending.queue.depth</code>
+    </td>
+    <td>The depth of the queue of job descriptions awaiting split
+    generation. The jobs read from the trace occupy a queue of this
+    depth before being processed by the submission threads. It is
+    unusual to configure this. The default is 5.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.gen.blocksize</code>
+    </td>
+    <td>The block-size of generated data. The default value is 256
+    MiB.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.gen.bytes.per.file</code>
+    </td>
+    <td>The maximum bytes written per file. The default value is 1
+    GiB.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.min.file.size</code>
+    </td>
+    <td>The minimum size of the input files. The default limit is 128
+    MiB. Tweak this parameter if you see an error-message like
+    "Found no satisfactory file" while testing GridMix with
+    a relatively-small input data-set.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.max.total.scan</code>
+    </td>
+    <td>The maximum size of the input files. The default limit is 100
+    TiB.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.task.jvm-options.enable</code>
+    </td>
+    <td>Enables Gridmix to configure the simulated task's max heap 
+        options using the values obtained from the original task (i.e via
+        trace).
+    </td>
+  </tr>
+</table>
+
+
+Job Types
+---------
+
+GridMix takes as input a job trace, essentially a stream of
+JSON-encoded job descriptions. For each job description, the submission
+client obtains the original job submission time and for each task in
+that job, the byte and record counts read and written. Given this data,
+it constructs a synthetic job with the same byte and record patterns as
+recorded in the trace. It constructs jobs of two types:
+
+<table>
+  <tr>
+    <th>Job Type</th>
+    <th>Description</th>
+  </tr>
+  <tr>
+    <td>
+      <code>LOADJOB</code>
+    </td>
+    <td>A synthetic job that emulates the workload mentioned in Rumen
+    trace. In the current version we are supporting I/O. It reproduces
+    the I/O workload on the benchmark cluster. It does so by embedding
+    the detailed I/O information for every map and reduce task, such as
+    the number of bytes and records read and written, into each
+    job's input splits. The map tasks further relay the I/O patterns of
+    reduce tasks through the intermediate map output data.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>SLEEPJOB</code>
+    </td>
+    <td>A synthetic job where each task does  *nothing*  but sleep
+    for a certain duration as observed in the production trace. The
+    scalability of the Job Tracker is often limited by how many
+    heartbeats it can handle every second. (Heartbeats are periodic
+    messages sent from Task Trackers to update their status and grab new
+    tasks from the Job Tracker.) Since a benchmark cluster is typically
+    a fraction in size of a production cluster, the heartbeat traffic
+    generated by the slave nodes is well below the level of the
+    production cluster. One possible solution is to run multiple Task
+    Trackers on each slave node. This leads to the obvious problem that
+    the I/O workload generated by the synthetic jobs would thrash the
+    slave nodes. Hence the need for such a job.</td>
+  </tr>
+</table>
+
+The following configuration parameters affect the job type:
+
+<table>
+  <tr>
+    <th>Parameter</th>
+    <th>Description</th>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.job.type</code>
+    </td>
+    <td>The value for this key can be one of LOADJOB or SLEEPJOB. The
+    default value is LOADJOB.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.key.fraction</code>
+    </td>
+    <td>For a LOADJOB type of job, the fraction of a record used for
+    the data for the key. The default value is 0.1.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.sleep.maptask-only</code>
+    </td>
+    <td>For a SLEEPJOB type of job, whether to ignore the reduce
+    tasks for the job. The default is <code>false</code>.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.sleep.fake-locations</code>
+    </td>
+    <td>For a SLEEPJOB type of job, the number of fake locations
+    for map tasks for the job. The default is 0.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.sleep.max-map-time</code>
+    </td>
+    <td>For a SLEEPJOB type of job, the maximum runtime for map
+    tasks for the job in milliseconds. The default is unlimited.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.sleep.max-reduce-time</code>
+    </td>
+    <td>For a SLEEPJOB type of job, the maximum runtime for reduce
+    tasks for the job in milliseconds. The default is unlimited.</td>
+  </tr>
+</table>
+
+
+<a name="policies"></a>
+
+Job Submission Policies
+-----------------------
+
+GridMix controls the rate of job submission. This control can be
+based on the trace information or can be based on statistics it gathers
+from the Job Tracker. Based on the submission policies users define,
+GridMix uses the respective algorithm to control the job submission.
+There are currently three types of policies:
+
+<table>
+  <tr>
+    <th>Job Submission Policy</th>
+    <th>Description</th>
+  </tr>
+  <tr>
+    <td>
+      <code>STRESS</code>
+    </td>
+    <td>Keep submitting jobs so that the cluster remains under stress.
+    In this mode we control the rate of job submission by monitoring
+    the real-time load of the cluster so that we can maintain a stable
+    stress level of workload on the cluster. Based on the statistics we
+    gather we define if a cluster is  *underloaded*  or
+     *overloaded* . We consider a cluster  *underloaded*  if
+    and only if the following three conditions are true:
+    <ol>
+      <li>the number of pending and running jobs are under a threshold
+      TJ</li>
+      <li>the number of pending and running maps are under threshold
+      TM</li>
+      <li>the number of pending and running reduces are under threshold
+      TR</li>
+    </ol>
+    The thresholds TJ, TM and TR are proportional to the size of the
+    cluster and map, reduce slots capacities respectively. In case of a
+    cluster being  *overloaded* , we throttle the job submission.
+    In the actual calculation we also weigh each running task with its
+    remaining work - namely, a 90% complete task is only counted as 0.1
+    in calculation. Finally, to avoid a very large job blocking other
+    jobs, we limit the number of pending/waiting tasks each job can
+    contribute.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>REPLAY</code>
+    </td>
+    <td>In this mode we replay the job traces faithfully. This mode
+    exactly follows the time-intervals given in the actual job
+    trace.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>SERIAL</code>
+    </td>
+    <td>In this mode we submit the next job only once the job submitted
+    earlier is completed.</td>
+  </tr>
+</table>
+
+The following configuration parameters affect the job submission policy:
+
+<table>
+  <tr>
+    <th>Parameter</th>
+    <th>Description</th>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.job-submission.policy</code>
+    </td>
+    <td>The value for this key would be one of the three: STRESS, REPLAY
+    or SERIAL. In most of the cases the value of key would be STRESS or
+    REPLAY. The default value is STRESS.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.throttle.jobs-to-tracker-ratio</code>
+    </td>
+    <td>In STRESS mode, the minimum ratio of running jobs to Task
+    Trackers in a cluster for the cluster to be considered
+     *overloaded* . This is the threshold TJ referred to earlier.
+    The default is 1.0.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.throttle.maps.task-to-slot-ratio</code>
+    </td>
+    <td>In STRESS mode, the minimum ratio of pending and running map
+    tasks (i.e. incomplete map tasks) to the number of map slots for
+    a cluster for the cluster to be considered  *overloaded* .
+    This is the threshold TM referred to earlier. Running map tasks are
+    counted partially. For example, a 40% complete map task is counted
+    as 0.6 map tasks. The default is 2.0.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.throttle.reduces.task-to-slot-ratio</code>
+    </td>
+    <td>In STRESS mode, the minimum ratio of pending and running reduce
+    tasks (i.e. incomplete reduce tasks) to the number of reduce slots
+    for a cluster for the cluster to be considered  *overloaded* .
+    This is the threshold TR referred to earlier. Running reduce tasks
+    are counted partially. For example, a 30% complete reduce task is
+    counted as 0.7 reduce tasks. The default is 2.5.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.throttle.maps.max-slot-share-per-job</code>
+    </td>
+    <td>In STRESS mode, the maximum share of a cluster's map-slots
+    capacity that can be counted toward a job's incomplete map tasks in
+    overload calculation. The default is 0.1.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.throttle.reducess.max-slot-share-per-job</code>
+    </td>
+    <td>In STRESS mode, the maximum share of a cluster's reduce-slots
+    capacity that can be counted toward a job's incomplete reduce tasks
+    in overload calculation. The default is 0.1.</td>
+  </tr>
+</table>
+
+
+<a name="usersqueues"></a>
+
+Emulating Users and Queues
+--------------------------
+
+Typical production clusters are often shared with different users and
+the cluster capacity is divided among different departments through job
+queues. Ensuring fairness among jobs from all users, honoring queue
+capacity allocation policies and avoiding an ill-behaving job from
+taking over the cluster adds significant complexity in Hadoop software.
+To be able to sufficiently test and discover bugs in these areas,
+GridMix must emulate the contentions of jobs from different users and/or
+submitted to different queues.
+
+Emulating multiple queues is easy - we simply set up the benchmark
+cluster with the same queue configuration as the production cluster and
+we configure synthetic jobs so that they get submitted to the same queue
+as recorded in the trace. However, not all users shown in the trace have
+accounts on the benchmark cluster. Instead, we set up a number of testing
+user accounts and associate each unique user in the trace to testing
+users in a round-robin fashion.
+
+The following configuration parameters affect the emulation of users
+and queues:
+
+<table>
+  <tr>
+    <th>Parameter</th>
+    <th>Description</th>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.job-submission.use-queue-in-trace</code>
+    </td>
+    <td>When set to <code>true</code> it uses exactly the same set of
+    queues as those mentioned in the trace. The default value is
+    <code>false</code>.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.job-submission.default-queue</code>
+    </td>
+    <td>Specifies the default queue to which all the jobs would be
+    submitted. If this parameter is not specified, GridMix uses the
+    default queue defined for the submitting user on the cluster.</td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.user.resolve.class</code>
+    </td>
+    <td>Specifies which <code>UserResolver</code> implementation to use.
+    We currently have three implementations:
+    <ol>
+      <li><code>org.apache.hadoop.mapred.gridmix.EchoUserResolver</code>
+      - submits a job as the user who submitted the original job. All
+      the users of the production cluster identified in the job trace
+      must also have accounts on the benchmark cluster in this case.</li>
+      <li><code>org.apache.hadoop.mapred.gridmix.SubmitterUserResolver</code>
+      - submits all the jobs as current GridMix user. In this case we
+      simply map all the users in the trace to the current GridMix user
+      and submit the job.</li>
+      <li><code>org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver</code>
+      - maps trace users to test users in a round-robin fashion. In
+      this case we set up a number of testing user accounts and
+      associate each unique user in the trace to testing users in a
+      round-robin fashion.</li>
+    </ol>
+    The default is
+    <code>org.apache.hadoop.mapred.gridmix.SubmitterUserResolver</code>.</td>
+  </tr>
+</table>
+
+If the parameter `gridmix.user.resolve.class` is set to
+`org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver`,
+we need to define a users-list file with a list of test users.
+This is specified using the `-users` option to GridMix.
+
+<note>
+Specifying a users-list file using the `-users` option is
+mandatory when using the round-robin user-resolver. Other user-resolvers
+ignore this option.
+</note>
+
+A users-list file has one user per line, each line of the format:
+
+    <username>
+
+For example:
+
+    user1
+    user2
+    user3
+
+In the above example we have defined three users `user1`, `user2` and `user3`.
+Now we would associate each unique user in the trace to the above users
+defined in round-robin fashion. For example, if trace's users are
+`tuser1`, `tuser2`, `tuser3`, `tuser4` and `tuser5`, then the mappings would be:
+
+    tuser1 -> user1
+    tuser2 -> user2
+    tuser3 -> user3
+    tuser4 -> user1
+    tuser5 -> user2
+
+For backward compatibility reasons, each line of users-list file can
+contain username followed by groupnames in the form username[,group]*.
+The groupnames will be ignored by Gridmix.
+
+
+Emulating Distributed Cache Load
+--------------------------------
+
+Gridmix emulates Distributed Cache load by default for LOADJOB type of
+jobs. This is done by precreating the needed Distributed Cache files for all
+the simulated jobs as part of a separate MapReduce job.
+
+Emulation of Distributed Cache load in gridmix simulated jobs can be
+disabled by configuring the property
+`gridmix.distributed-cache-emulation.enable` to
+`false`.
+But generation of Distributed Cache data by gridmix is driven by
+`-generate` option and is independent of this configuration
+property.
+
+Both generation of Distributed Cache files and emulation of
+Distributed Cache load are disabled if:
+
+* input trace comes from the standard input-stream instead of file, or
+* `<iopath>` specified is on local file-system, or
+* any of the ascendant directories of the distributed cache directory
+  i.e. `<iopath>/distributedCache` (including the distributed
+  cache directory) doesn't have execute permission for others.
+
+
+Configuration of Simulated Jobs
+-------------------------------
+
+Gridmix3 sets some configuration properties in the simulated Jobs
+submitted by it so that they can be mapped back to the corresponding Job
+in the input Job trace. These configuration parameters include:
+
+<table>
+  <tr>
+    <th>Parameter</th>
+    <th>Description</th>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.job.original-job-id</code>
+    </td>
+    <td> The job id of the original cluster's job corresponding to this
+    simulated job.
+    </td>
+  </tr>
+  <tr>
+    <td>
+      <code>gridmix.job.original-job-name</code>
+    </td>
+    <td> The job name of the original cluster's job corresponding to this
+    simulated job.
+    </td>
+  </tr>
+</table>
+
+
+Emulating Compression/Decompression
+-----------------------------------
+
+MapReduce supports data compression and decompression. 
+Input to a MapReduce job can be compressed. Similarly, output of Map
+and Reduce tasks can also be compressed. Compression/Decompression 
+emulation in GridMix is important because emulating 
+compression/decompression will effect the CPU and Memory usage of the 
+task. A task emulating compression/decompression will affect other 
+tasks and daemons running on the same node.
+
+Compression emulation is enabled if 
+`gridmix.compression-emulation.enable` is set to
+`true`. By default compression emulation is enabled for 
+jobs of type  *LOADJOB* . With compression emulation enabled, 
+GridMix will now generate compressed text data with a constant 
+compression ratio. Hence a simulated GridMix job will now emulate 
+compression/decompression using compressible text data (having a 
+constant compression ratio), irrespective of the compression ratio 
+observed in the actual job.
+
+A typical MapReduce Job deals with data compression/decompression in 
+the following phases
+
+* `Job input data decompression: ` GridMix generates 
+  compressible input data when compression emulation is enabled. 
+  Based on the original job's configuration, a simulated GridMix job 
+  will use a decompressor to read the compressed input data. 
+  Currently, GridMix uses
+  `mapreduce.input.fileinputformat.inputdir` to determine 
+  if the original job used compressed input data or 
+  not. If the original job's input files are uncompressed then the 
+  simulated job will read the compressed input file without using a 
+  decompressor. 
+
+* `Intermediate data compression and decompression: `
+  If the original job has map output compression enabled then GridMix 
+  too will enable map output compression for the simulated job. 
+  Accordingly, the reducers will use a decompressor to read the map 
+  output data.
+
+* `Job output data compression: `
+  If the original job's output is compressed then GridMix 
+  too will enable job output compression for the simulated job. 
+       
+The following configuration parameters affect compression emulation
+
+<table>
+  <tr>
+    <th>Parameter</th>
+    <th>Description</th>
+  </tr>
+  <tr>
+    <td>gridmix.compression-emulation.enable</td>
+    <td>Enables compression emulation in simulated GridMix jobs. 
+        Default is true.</td>
+  </tr>
+</table>
+      
+With compression emulation turned on, GridMix will generate compressed
+input data. Hence the total size of the input 
+data will be lesser than the expected size. Set 
+`gridmix.min.file.size` to a smaller value (roughly 10% of
+`gridmix.gen.bytes.per.file`) for enabling GridMix to 
+correctly emulate compression.
+
+
+Emulating High-Ram jobs
+-----------------------
+
+MapReduce allows users to define a job as a High-Ram job. Tasks from a 
+High-Ram job can occupy multiple slots on the task-trackers. 
+Task-tracker assigns fixed virtual memory for each slot. Tasks from 
+High-Ram jobs can occupy multiple slots and thus can use up more 
+virtual memory as compared to a default task.
+
+Emulating this behavior is important because of the following reasons
+
+* Impact on scheduler:  Scheduling of tasks from High-Ram jobs 
+  impacts the scheduling behavior as it might result into slot 
+  reservation and slot/resource utilization.
+
+* Impact on the node : Since High-Ram tasks occupy multiple slots,
+  trackers do some bookkeeping for allocating extra resources for 
+  these tasks. Thus this becomes a precursor for memory emulation
+  where tasks with high memory requirements needs to be considered
+  as a High-Ram task.
+
+High-Ram feature emulation can be disabled by setting  
+`gridmix.highram-emulation.enable` to `false`.
+    
+
+Emulating resource usages
+-------------------------
+
+Usages of resources like CPU, physical memory, virtual memory, JVM heap
+etc are recorded by MapReduce using its task counters. This information
+is used by GridMix to emulate the resource usages in the simulated 
+tasks. Emulating resource usages will help GridMix exert similar load 
+on the test cluster as seen in the actual cluster.
+
+MapReduce tasks use up resources during its entire lifetime. GridMix
+also tries to mimic this behavior by spanning resource usage emulation
+across the entire lifetime of the simulated task. Each resource to be
+emulated should have an  *emulator*  associated with it.
+Each such  *emulator*  should implement the 
+`org.apache.hadoop.mapred.gridmix.emulators.resourceusage
+.ResourceUsageEmulatorPlugin` interface. Resource 
+ *emulators*  in GridMix are  *plugins*  that can be 
+configured (plugged in or out) before every run. GridMix users can 
+configure multiple emulator  *plugins*  by passing a comma 
+separated list of  *emulators*  as a value for the 
+`gridmix.emulators.resource-usage.plugins` parameter. 
+
+List of  *emulators*  shipped with GridMix:
+
+* Cumulative CPU usage  *emulator* : 
+  GridMix uses the cumulative CPU usage value published by Rumen 
+  and makes sure that the total cumulative CPU usage of the simulated 
+  task is close to the value published by Rumen. GridMix can be 
+  configured to emulate cumulative CPU usage by adding 
+  `org.apache.hadoop.mapred.gridmix.emulators.resourceusage
+  .CumulativeCpuUsageEmulatorPlugin` to the list of emulator 
+   *plugins*  configured for the 
+  `gridmix.emulators.resource-usage.plugins` parameter.
+  CPU usage emulator is designed in such a way that
+  it only emulates at specific progress boundaries of the task. This 
+  interval can be configured using 
+  `gridmix.emulators.resource-usage.cpu.emulation-interval`.
+  The default value for this parameter is `0.1` i.e 
+  `10%`.
+
+* Total heap usage  *emulator* : 
+  GridMix uses the total heap usage value published by Rumen 
+  and makes sure that the total heap usage of the simulated 
+  task is close to the value published by Rumen. GridMix can be 
+  configured to emulate total heap usage by adding 
+  `org.apache.hadoop.mapred.gridmix.emulators.resourceusage
+  .TotalHeapUsageEmulatorPlugin` to the list of emulator 
+   *plugins*  configured for the 
+  `gridmix.emulators.resource-usage.plugins` parameter.
+  Heap usage emulator is designed in such a way that
+  it only emulates at specific progress boundaries of the task. This 
+  interval can be configured using 
+  `gridmix.emulators.resource-usage.heap.emulation-interval
+  `. The default value for this parameter is `0.1` 
+  i.e `10%` progress interval.
+
+Note that GridMix will emulate resource usages only for jobs of type *LOADJOB* .
+
+
+Simplifying Assumptions
+-----------------------
+
+GridMix will be developed in stages, incorporating feedback and
+patches from the community. Currently its intent is to evaluate
+MapReduce and HDFS performance and not the layers on top of them (i.e.
+the extensive lib and sub-project space). Given these two limitations,
+the following characteristics of job load are not currently captured in
+job traces and cannot be accurately reproduced in GridMix:
+
+* *Filesystem Properties*  - No attempt is made to match block
+  sizes, namespace hierarchies, or any property of input, intermediate
+  or output data other than the bytes/records consumed and emitted from
+  a given task. This implies that some of the most heavily-used parts of
+  the system - text processing, streaming, etc. - cannot be meaningfully tested 
+  with the current implementation.
+  
+* *I/O Rates*  - The rate at which records are
+  consumed/emitted is assumed to be limited only by the speed of the
+  reader/writer and constant throughout the task.
+  
+* *Memory Profile*  - No data on tasks' memory usage over time
+  is available, though the max heap-size is retained.
+  
+* *Skew*  - The records consumed and emitted to/from a given
+  task are assumed to follow observed averages, i.e. records will be
+  more regular than may be seen in the wild. Each map also generates
+  a proportional percentage of data for each reduce, so a job with
+  unbalanced input will be flattened.
+  
+* *Job Failure*  - User code is assumed to be correct.
+  
+* *Job Independence*  - The output or outcome of one job does
+  not affect when or whether a subsequent job will run.
+
+
+Appendix
+--------
+
+Issues tracking the original implementations of
+<a href="https://issues.apache.org/jira/browse/HADOOP-2369">GridMix1</a>,
+<a href="https://issues.apache.org/jira/browse/HADOOP-3770">GridMix2</a>,
+and <a href="https://issues.apache.org/jira/browse/MAPREDUCE-776">GridMix3</a>
+can be found on the Apache Hadoop MapReduce JIRA. Other issues tracking
+the current development of GridMix can be found by searching
+<a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313086">
+the Apache Hadoop MapReduce JIRA</a>

+ 397 - 0
hadoop-tools/hadoop-rumen/src/site/markdown/Rumen.md.vm

@@ -0,0 +1,397 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+#set ( $H3 = '###' )
+#set ( $H4 = '####' )
+#set ( $H5 = '#####' )
+
+Rumen
+=====
+
+---
+
+- [Overview](#Overview)
+    - [Motivation](#Motivation)
+    - [Components](#Components)
+- [How to use Rumen?](#How_to_use_Rumen)
+    - [Trace Builder](#Trace_Builder)
+        - [Example](#Example)
+    - [Folder](#Folder)
+        - [Examples](#Examples)
+- [Appendix](#Appendix)
+    - [Resources](#Resources)
+    - [Dependencies](#Dependencies)
+
+---
+
+Overview
+--------
+    
+*Rumen* is a data extraction and analysis tool built for
+*Apache Hadoop*. *Rumen* mines *JobHistory* logs to 
+extract meaningful data and stores it in an easily-parsed, condensed 
+format or *digest*. The raw trace data from MapReduce logs are 
+often insufficient for simulation, emulation, and benchmarking, as these 
+tools often attempt to measure conditions that did not occur in the 
+source data. For example, if a task ran locally in the raw trace data 
+but a simulation of the scheduler elects to run that task on a remote 
+rack, the simulator requires a runtime its input cannot provide. 
+To fill in these gaps, Rumen performs a statistical analysis of the 
+digest to estimate the variables the trace doesn't supply. Rumen traces 
+drive both Gridmix (a benchmark of Hadoop MapReduce clusters) and Mumak 
+(a simulator for the JobTracker).
+
+
+$H3 Motivation
+     
+* Extracting meaningful data from *JobHistory* logs is a common
+  task for any tool built to work on *MapReduce*. It 
+  is tedious to write a custom tool which is so tightly coupled with 
+  the *MapReduce* framework. Hence there is a need for a 
+  built-in tool for performing framework level task of log parsing and
+  analysis. Such a tool would insulate external systems depending on 
+  job history against the changes made to the job history format.
+
+* Performing statistical analysis of various attributes of a 
+  *MapReduce Job* such as *task runtimes, task failures 
+  etc* is another common task that the benchmarking 
+  and simulation tools might need. *Rumen* generates 
+  <a href="http://en.wikipedia.org/wiki/Cumulative_distribution_function">
+  *Cumulative Distribution Functions (CDF)*
+  </a> for the Map/Reduce task runtimes. 
+  Runtime CDF can be used for extrapolating the task runtime of 
+  incomplete, missing and synthetic tasks. Similarly CDF is also 
+  computed for the total number of successful tasks for every attempt.
+
+
+$H3 Components
+
+*Rumen* consists of 2 components
+
+* *Trace Builder* : 
+  Converts *JobHistory* logs into an easily-parsed format.
+  Currently `TraceBuilder` outputs the trace in 
+  <a href="http://www.json.org/">*JSON*</a> 
+  format.   
+
+* *Folder *: 
+  A utility to scale the input trace. A trace obtained from
+  *TraceBuilder* simply summarizes the jobs in the 
+  input folders and files. The time-span within which all the jobs in 
+  a given trace finish can be considered as the trace runtime. 
+  *Folder* can be used to scale the runtime of a trace.
+  Decreasing the trace runtime might involve dropping some jobs from 
+  the input trace and scaling down the runtime of remaining jobs. 
+  Increasing the trace runtime might involve adding some dummy jobs to
+  the resulting trace and scaling up the runtime of individual jobs.
+
+
+How to use Rumen?
+-----------------
+
+Converting *JobHistory* logs into a desired job-trace consists of  2 steps
+
+1.  Extracting information into an intermediate format
+
+2.  Adjusting the job-trace obtained from the intermediate trace to 
+    have the desired properties.
+       
+> Extracting information from *JobHistory* logs is a one time
+> operation. This so called *Gold Trace* can be reused to
+> generate traces with desired values of properties such as 
+> `output-duration`, `concentration` etc.
+       
+*Rumen* provides 2 basic commands
+
+* `TraceBuilder`
+* `Folder`
+       
+Firstly, we need to generate the *Gold Trace*. Hence the first 
+step is to run `TraceBuilder` on a job-history folder. 
+The output of the `TraceBuilder` is a job-trace file (and an 
+optional cluster-topology file). In case we want to scale the output, we 
+can use the `Folder` utility to fold the current trace to the 
+desired length. The remaining part of this section explains these 
+utilities in detail.
+    
+> Examples in this section assumes that certain libraries are present 
+> in the java CLASSPATH. See <em>Section-3.2</em> for more details.
+
+
+$H3 Trace Builder
+      
+`Command:`
+
+    java org.apache.hadoop.tools.rumen.TraceBuilder [options] <jobtrace-output> <topology-output> <inputs>
+  
+This command invokes the `TraceBuilder` utility of
+*Rumen*. It converts the JobHistory files into a series of JSON
+objects and writes them into the `<jobtrace-output>`
+file. It also extracts the cluster layout (topology) and writes it in
+the`<topology-output>` file.
+`<inputs>` represents a space-separated list of
+JobHistory files and folders.
+         
+> 1) Input and output to `TraceBuilder` is expected to
+> be a fully qualified FileSystem path. So use file://
+> to specify files on the `local` FileSystem and 
+> hdfs:// to specify files on HDFS. Since input files or
+> folder are FileSystem paths, it means that they can be globbed.
+> This can be useful while specifying multiple file paths using
+> regular expressions.
+
+> 2) By default, TraceBuilder does not recursively scan the input
+> folder for job history files. Only the files that are directly
+> placed under the input folder will be considered for generating
+> the trace. To add all the files under the input directory by
+> recursively scanning the input directory, use ‘-recursive’
+> option.
+
+Cluster topology is used as follows :
+
+* To reconstruct the splits and make sure that the 
+  distances/latencies seen in the actual run are modeled correctly.
+
+* To extrapolate splits information for tasks with missing splits
+  details or synthetically generated tasks.
+      
+`Options :`
+
+<table>
+  <tr>
+    <th> Parameter</th>
+    <th> Description</th>
+    <th> Notes </th>
+  </tr>
+  <tr>
+    <td><code>-demuxer</code></td>
+    <td>Used to read the jobhistory files. The default is 
+        <code>DefaultInputDemuxer</code>.</td>
+    <td>Demuxer decides how the input file maps to jobhistory file(s). 
+        Job history logs and job configuration files are typically small 
+        files, and can be more effectively stored when embedded in some
+        container file format like SequenceFile or TFile. To support such 
+        usage cases, one can specify a customized Demuxer class that can 
+        extract individual job history logs and job configuration files 
+        from the source files.
+    </td>
+  </tr>
+  <tr>
+     <td><code>-recursive</code></td>
+     <td>Recursively traverse input paths for job history logs.</td>
+     <td>This option should be used to inform the TraceBuilder to
+     recursively scan the input paths and process all the files under it.
+     Note that, by default, only the history logs that are directly under
+     the input folder are considered for generating the trace.
+     </td>
+  </tr>
+</table>
+      
+
+$H4 Example
+
+    java org.apache.hadoop.tools.rumen.TraceBuilder file:///home/user/job-trace.json file:///home/user/topology.output file:///home/user/logs/history/done
+
+This will analyze all the jobs in 
+
+`/home/user/logs/history/done` stored on the 
+`local` FileSystem and output the jobtraces in 
+`/home/user/job-trace.json` along with topology 
+information in `/home/user/topology.output`.
+
+
+$H3 Folder
+      
+`Command`:
+
+    java org.apache.hadoop.tools.rumen.Folder [options] [input] [output]
+      
+> Input and output to `Folder` is expected to be a fully 
+> qualified FileSystem path. So use file:// to specify 
+> files on the `local` FileSystem and hdfs:// to
+> specify files on HDFS.
+      
+This command invokes the `Folder` utility of 
+*Rumen*. Folding essentially means that the output duration of 
+the resulting trace is fixed and job timelines are adjusted 
+to respect the final output duration. 
+
+`Options :`
+
+<table>
+  <tr>
+    <th> Parameter</th>
+    <th> Description</th>
+    <th> Notes </th>
+  </tr>
+  <tr>
+    <td><code>-input-cycle</code></td>
+    <td>Defines the basic unit of time for the folding operation. There is
+        no default value for <code>input-cycle</code>. 
+        <strong>Input cycle must be provided</strong>.
+    </td>
+    <td>'<code>-input-cycle 10m</code>' 
+        implies that the whole trace run will be now sliced at a 10min 
+        interval. Basic operations will be done on the 10m chunks. Note 
+        that *Rumen* understands various time units like 
+        <em>m(min), h(hour), d(days) etc</em>.
+    </td>
+  </tr>
+  <tr>
+    <td><code>-output-duration</code></td>
+    <td>This parameter defines the final runtime of the trace. 
+        Default value if <strong>1 hour</strong>.
+    </td>
+    <td>'<code>-output-duration 30m</code>' 
+        implies that the resulting trace will have a max runtime of 
+        30mins. All the jobs in the input trace file will be folded and 
+        scaled to fit this window.
+    </td>
+  </tr>
+  <tr>
+    <td><code>-concentration</code></td>
+    <td>Set the concentration of the resulting trace. Default value is 
+        <strong>1</strong>.
+    </td>
+    <td>If the total runtime of the resulting trace is less than the total
+        runtime of the input trace, then the resulting trace would contain
+        lesser number of jobs as compared to the input trace. This 
+        essentially means that the output is diluted. To increase the 
+        density of jobs, set the concentration to a higher value.</td>
+  </tr>
+  <tr>
+    <td><code>-debug</code></td>
+    <td>Run the Folder in debug mode. By default it is set to 
+        <strong>false</strong>.</td>
+    <td>In debug mode, the Folder will print additional statements for 
+        debugging. Also the intermediate files generated in the scratch 
+        directory will not be cleaned up.
+    </td>
+  </tr>
+  <tr>
+    <td><code>-seed</code></td>
+    <td>Initial seed to the Random Number Generator. By default, a Random 
+        Number Generator is used to generate a seed and the seed value is
+        reported back to the user for future use.
+    </td>
+    <td>If an initial seed is passed, then the <code>Random Number 
+        Generator</code> will generate the random numbers in the same 
+        sequence i.e the sequence of random numbers remains same if the 
+        same seed is used. Folder uses Random Number Generator to decide 
+        whether or not to emit the job. 
+    </td>
+  </tr>
+  <tr>
+    <td><code>-temp-directory</code></td>
+    <td>Temporary directory for the Folder. By default the <strong>output
+        folder's parent directory</strong> is used as the scratch space.
+    </td>
+    <td>This is the scratch space used by Folder.  All the 
+        temporary files are cleaned up in the end unless the Folder is run
+        in <code>debug</code> mode.</td>
+  </tr>
+  <tr>
+    <td><code>-skew-buffer-length</code></td>
+    <td>Enables <em>Folder</em> to tolerate skewed jobs.
+        The default buffer length is <strong>0</strong>.</td>
+    <td>'<code>-skew-buffer-length 100</code>' 
+        indicates that if the jobs appear out of order within a window 
+        size of 100, then they will be emitted in-order by the folder. 
+        If a job appears out-of-order outside this window, then the Folder
+        will bail out provided <code>-allow-missorting</code> is not set.
+        <em>Folder</em> reports the maximum skew size seen in the 
+        input trace for future use.
+    </td>
+  </tr>
+  <tr>
+    <td><code>-allow-missorting</code></td>
+    <td>Enables <em>Folder</em> to tolerate out-of-order jobs. By default 
+        mis-sorting is not allowed.
+    </td>
+    <td>If mis-sorting is allowed, then the <em>Folder</em> will ignore 
+        out-of-order jobs that cannot be deskewed using a skew buffer of
+        size specified using <code>-skew-buffer-length</code>. If 
+        mis-sorting is not allowed, then the Folder will bail out if the
+        skew buffer is incapable of tolerating the skew.
+    </td>
+  </tr>
+</table>
+
+
+$H4 Examples
+
+$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime
+               
+    java org.apache.hadoop.tools.rumen.Folder -output-duration 1h  -input-cycle 20m  file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json
+
+If the folded jobs are out of order then the command will bail out. 
+
+$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime and tolerate some skewness
+
+    java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m  -allow-missorting -skew-buffer-length 100 file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json
+
+If the folded jobs are out of order, then atmost
+100 jobs will be de-skewed. If the 101<sup>st</sup> job is 
+*out-of-order*, then the command will bail out.
+
+$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime in debug mode
+
+    java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m  -debug -temp-directory file:///tmp/debug file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json
+
+This will fold the 10hr job-trace file 
+`file:///home/user/job-trace.json` to finish within 1hr 
+and use `file:///tmp/debug` as the temporary directory. 
+The intermediate files in the temporary directory will not be cleaned
+up.
+      
+$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime with custom concentration.
+
+    java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m  -concentration 2  file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json</source>
+
+This will fold the 10hr job-trace file 
+`file:///home/user/job-trace.json` to finish within 1hr 
+with concentration of 2. `Example-2.3.2` will retain 10% 
+of the jobs. With *concentration* as 2, 20% of the total input 
+jobs will be retained.
+
+
+Appendix
+--------
+    
+$H3 Resources
+
+<a href="https://issues.apache.org/jira/browse/MAPREDUCE-751">MAPREDUCE-751</a>
+is the main JIRA that introduced *Rumen* to *MapReduce*. 
+Look at the MapReduce
+<a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313617">
+rumen-component</a>for further details.
+
+
+$H3 Dependencies
+
+*Rumen* expects certain library *JARs* to be present in 
+the *CLASSPATH*. The required libraries are
+
+* `Hadoop MapReduce Tools` (`hadoop-mapred-tools-{hadoop-version}.jar`)
+* `Hadoop Common` (`hadoop-common-{hadoop-version}.jar`)
+* `Apache Commons Logging` (`commons-logging-1.1.1.jar`)
+* `Apache Commons CLI` (`commons-cli-1.2.jar`)
+* `Jackson Mapper` (`jackson-mapper-asl-1.4.2.jar`)
+* `Jackson Core` (`jackson-core-asl-1.4.2.jar`)
+
+> One simple way to run Rumen is to use '$HADOOP_HOME/bin/hadoop jar' 
+> option  to run it.