11 éve · 27f9edceed
--- a/hadoop-common-project/hadoop-common/CHANGES.txt
+++ b/hadoop-common-project/hadoop-common/CHANGES.txt
@@ -212,6 +212,9 @@ Release 2.6.0 - UNRELEASED
 
				     HADOOP-8808. Update FsShell documentation to mention deprecation of some 
			
 
				     of the commands, and mention alternatives (Akira AJISAKA via aw)
			
 
				 
			
 
				+    HADOOP-10954. Adding site documents of hadoop-tools (Masatake Iwasaki 
			
 
				+    via aw)
			
 
				+
			
 
				   OPTIMIZATIONS
			
 
				 
			
 
				     HADOOP-10838. Byte array native checksumming. (James Thomas via todd)
			
--- a/hadoop-tools/hadoop-gridmix/src/site/markdown/GridMix.md.vm
+++ b/hadoop-tools/hadoop-gridmix/src/site/markdown/GridMix.md.vm
@@ -0,0 +1,818 @@
 
				+<!---
			
 
				+  Licensed under the Apache License, Version 2.0 (the "License");
			
 
				+  you may not use this file except in compliance with the License.
			
 
				+  You may obtain a copy of the License at
			
 
				+
			
 
				+    http://www.apache.org/licenses/LICENSE-2.0
			
 
				+
			
 
				+  Unless required by applicable law or agreed to in writing, software
			
 
				+  distributed under the License is distributed on an "AS IS" BASIS,
			
 
				+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
			
 
				+  See the License for the specific language governing permissions and
			
 
				+  limitations under the License. See accompanying LICENSE file.
			
 
				+-->
			
 
				+
			
 
				+Gridmix
			
 
				+=======
			
 
				+
			
 
				+---
			
 
				+
			
 
				+- [Overview](#Overview)
			
 
				+- [Usage](#Usage)
			
 
				+- [General Configuration Parameters](#General_Configuration_Parameters)
			
 
				+- [Job Types](#Job_Types)
			
 
				+- [Job Submission Policies](#Job_Submission_Policies)
			
 
				+- [Emulating Users and Queues](#Emulating_Users_and_Queues)
			
 
				+- [Emulating Distributed Cache Load](#Emulating_Distributed_Cache_Load)
			
 
				+- [Configuration of Simulated Jobs](#Configuration_of_Simulated_Jobs)
			
 
				+- [Emulating Compression/Decompression](#Emulating_CompressionDecompression)
			
 
				+- [Emulating High-Ram jobs](#Emulating_High-Ram_jobs)
			
 
				+- [Emulating resource usages](#Emulating_resource_usages)
			
 
				+- [Simplifying Assumptions](#Simplifying_Assumptions)
			
 
				+- [Appendix](#Appendix)
			
 
				+
			
 
				+---
			
 
				+
			
 
				+Overview
			
 
				+--------
			
 
				+
			
 
				+GridMix is a benchmark for Hadoop clusters. It submits a mix of
			
 
				+synthetic jobs, modeling a profile mined from production loads.
			
 
				+
			
 
				+There exist three versions of the GridMix tool. This document
			
 
				+discusses the third (checked into `src/contrib` ), distinct
			
 
				+from the two checked into the `src/benchmarks` sub-directory.
			
 
				+While the first two versions of the tool included stripped-down versions
			
 
				+of common jobs, both were principally saturation tools for stressing the
			
 
				+framework at scale. In support of a broader range of deployments and
			
 
				+finer-tuned job mixes, this version of the tool will attempt to model
			
 
				+the resource profiles of production jobs to identify bottlenecks, guide
			
 
				+development, and serve as a replacement for the existing GridMix
			
 
				+benchmarks.
			
 
				+
			
 
				+To run GridMix, you need a MapReduce job trace describing the job mix
			
 
				+for a given cluster. Such traces are typically generated by Rumen (see
			
 
				+Rumen documentation). GridMix also requires input data from which the
			
 
				+synthetic jobs will be reading bytes. The input data need not be in any
			
 
				+particular format, as the synthetic jobs are currently binary readers.
			
 
				+If you are running on a new cluster, an optional step generating input
			
 
				+data may precede the run.
			
 
				+In order to emulate the load of production jobs from a given cluster
			
 
				+on the same or another cluster, follow these steps:
			
 
				+
			
 
				+1.  Locate the job history files on the production cluster. This
			
 
				+    location is specified by the
			
 
				+    `mapred.job.tracker.history.completed.location`
			
 
				+    configuration property of the cluster.
			
 
				+
			
 
				+2.  Run Rumen to build a job trace in JSON format for all or select jobs.
			
 
				+
			
 
				+3.  Use GridMix with the job trace on the benchmark cluster.
			
 
				+
			
 
				+Jobs submitted by GridMix have names of the form
			
 
				+"`GRIDMIXnnnnnn`", where
			
 
				+"`nnnnnn`" is a sequence number padded with leading zeroes.
			
 
				+
			
 
				+
			
 
				+Usage
			
 
				+-----
			
 
				+
			
 
				+Basic command-line usage without configuration parameters:
			
 
				+
			
 
				+    org.apache.hadoop.mapred.gridmix.Gridmix [-generate <size>] [-users <users-list>] <iopath> <trace>
			
 
				+
			
 
				+Basic command-line usage with configuration parameters:
			
 
				+
			
 
				+    org.apache.hadoop.mapred.gridmix.Gridmix \
			
 
				+      -Dgridmix.client.submit.threads=10 -Dgridmix.output.directory=foo \
			
 
				+      [-generate <size>] [-users <users-list>] <iopath> <trace>
			
 
				+
			
 
				+> Configuration parameters like
			
 
				+> `-Dgridmix.client.submit.threads=10` and
			
 
				+> `-Dgridmix.output.directory=foo` as given above should
			
 
				+> be used  *before*  other GridMix parameters.
			
 
				+
			
 
				+The `<iopath>` parameter is the working directory for
			
 
				+GridMix. Note that this can either be on the local file-system
			
 
				+or on HDFS, but it is highly recommended that it be the same as that for
			
 
				+the original job mix so that GridMix puts the same load on the local
			
 
				+file-system and HDFS respectively.
			
 
				+
			
 
				+The `-generate` option is used to generate input data and
			
 
				+Distributed Cache files for the synthetic jobs. It accepts standard units
			
 
				+of size suffixes, e.g. `100g` will generate
			
 
				+100 * 2<sup>30</sup> bytes as input data.
			
 
				+`<iopath>/input` is the destination directory for
			
 
				+generated input data and/or the directory from which input data will be
			
 
				+read. HDFS-based Distributed Cache files are generated under the
			
 
				+distributed cache directory `<iopath>/distributedCache`.
			
 
				+If some of the needed Distributed Cache files are already existing in the
			
 
				+distributed cache directory, then only the remaining non-existing
			
 
				+Distributed Cache files are generated when `-generate` option
			
 
				+is specified.
			
 
				+
			
 
				+The `-users` option is used to point to a users-list
			
 
				+file (see <a href="#usersqueues">Emulating Users and Queues</a>).
			
 
				+
			
 
				+The `<trace>` parameter is a path to a job trace
			
 
				+generated by Rumen. This trace can be compressed (it must be readable
			
 
				+using one of the compression codecs supported by the cluster) or
			
 
				+uncompressed. Use "-" as the value of this parameter if you
			
 
				+want to pass an  *uncompressed*  trace via the standard
			
 
				+input-stream of GridMix.
			
 
				+
			
 
				+The class `org.apache.hadoop.mapred.gridmix.Gridmix` can
			
 
				+be found in the JAR
			
 
				+`contrib/gridmix/hadoop-gridmix-$VERSION.jar` inside your
			
 
				+Hadoop installation, where `$VERSION` corresponds to the
			
 
				+version of Hadoop installed. A simple way of ensuring that this class
			
 
				+and all its dependencies are loaded correctly is to use the
			
 
				+`hadoop` wrapper script in Hadoop:
			
 
				+
			
 
				+    hadoop jar <gridmix-jar> org.apache.hadoop.mapred.gridmix.Gridmix \
			
 
				+      [-generate <size>] [-users <users-list>] <iopath> <trace>
			
 
				+
			
 
				+The supported configuration parameters are explained in the
			
 
				+following sections.
			
 
				+
			
 
				+
			
 
				+General Configuration Parameters
			
 
				+--------------------------------
			
 
				+
			
 
				+<table>
			
 
				+  <tr>
			
 
				+    <th>Parameter</th>
			
 
				+    <th>Description</th>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.output.directory</code>
			
 
				+    </td>
			
 
				+    <td>The directory into which output will be written. If specified,
			
 
				+    <code>iopath</code> will be relative to this parameter. The
			
 
				+    submitting user must have read/write access to this directory. The
			
 
				+    user should also be mindful of any quota issues that may arise
			
 
				+    during a run. The default is "<code>gridmix</code>".</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.client.submit.threads</code>
			
 
				+    </td>
			
 
				+    <td>The number of threads submitting jobs to the cluster. This
			
 
				+    also controls how many splits will be loaded into memory at a given
			
 
				+    time, pending the submit time in the trace. Splits are pre-generated
			
 
				+    to hit submission deadlines, so particularly dense traces may want
			
 
				+    more submitting threads. However, storing splits in memory is
			
 
				+    reasonably expensive, so you should raise this cautiously. The
			
 
				+    default is 1 for the SERIAL job-submission policy (see
			
 
				+    <a href="#policies">Job Submission Policies</a>) and one more than
			
 
				+    the number of processors on the client machine for the other
			
 
				+    policies.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.submit.multiplier</code>
			
 
				+    </td>
			
 
				+    <td>The multiplier to accelerate or decelerate the submission of
			
 
				+    jobs. The time separating two jobs is multiplied by this factor.
			
 
				+    The default value is 1.0. This is a crude mechanism to size
			
 
				+    a job trace to a cluster.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.client.pending.queue.depth</code>
			
 
				+    </td>
			
 
				+    <td>The depth of the queue of job descriptions awaiting split
			
 
				+    generation. The jobs read from the trace occupy a queue of this
			
 
				+    depth before being processed by the submission threads. It is
			
 
				+    unusual to configure this. The default is 5.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.gen.blocksize</code>
			
 
				+    </td>
			
 
				+    <td>The block-size of generated data. The default value is 256
			
 
				+    MiB.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.gen.bytes.per.file</code>
			
 
				+    </td>
			
 
				+    <td>The maximum bytes written per file. The default value is 1
			
 
				+    GiB.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.min.file.size</code>
			
 
				+    </td>
			
 
				+    <td>The minimum size of the input files. The default limit is 128
			
 
				+    MiB. Tweak this parameter if you see an error-message like
			
 
				+    "Found no satisfactory file" while testing GridMix with
			
 
				+    a relatively-small input data-set.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.max.total.scan</code>
			
 
				+    </td>
			
 
				+    <td>The maximum size of the input files. The default limit is 100
			
 
				+    TiB.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.task.jvm-options.enable</code>
			
 
				+    </td>
			
 
				+    <td>Enables Gridmix to configure the simulated task's max heap 
			
 
				+        options using the values obtained from the original task (i.e via
			
 
				+        trace).
			
 
				+    </td>
			
 
				+  </tr>
			
 
				+</table>
			
 
				+
			
 
				+
			
 
				+Job Types
			
 
				+---------
			
 
				+
			
 
				+GridMix takes as input a job trace, essentially a stream of
			
 
				+JSON-encoded job descriptions. For each job description, the submission
			
 
				+client obtains the original job submission time and for each task in
			
 
				+that job, the byte and record counts read and written. Given this data,
			
 
				+it constructs a synthetic job with the same byte and record patterns as
			
 
				+recorded in the trace. It constructs jobs of two types:
			
 
				+
			
 
				+<table>
			
 
				+  <tr>
			
 
				+    <th>Job Type</th>
			
 
				+    <th>Description</th>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>LOADJOB</code>
			
 
				+    </td>
			
 
				+    <td>A synthetic job that emulates the workload mentioned in Rumen
			
 
				+    trace. In the current version we are supporting I/O. It reproduces
			
 
				+    the I/O workload on the benchmark cluster. It does so by embedding
			
 
				+    the detailed I/O information for every map and reduce task, such as
			
 
				+    the number of bytes and records read and written, into each
			
 
				+    job's input splits. The map tasks further relay the I/O patterns of
			
 
				+    reduce tasks through the intermediate map output data.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>SLEEPJOB</code>
			
 
				+    </td>
			
 
				+    <td>A synthetic job where each task does  *nothing*  but sleep
			
 
				+    for a certain duration as observed in the production trace. The
			
 
				+    scalability of the Job Tracker is often limited by how many
			
 
				+    heartbeats it can handle every second. (Heartbeats are periodic
			
 
				+    messages sent from Task Trackers to update their status and grab new
			
 
				+    tasks from the Job Tracker.) Since a benchmark cluster is typically
			
 
				+    a fraction in size of a production cluster, the heartbeat traffic
			
 
				+    generated by the slave nodes is well below the level of the
			
 
				+    production cluster. One possible solution is to run multiple Task
			
 
				+    Trackers on each slave node. This leads to the obvious problem that
			
 
				+    the I/O workload generated by the synthetic jobs would thrash the
			
 
				+    slave nodes. Hence the need for such a job.</td>
			
 
				+  </tr>
			
 
				+</table>
			
 
				+
			
 
				+The following configuration parameters affect the job type:
			
 
				+
			
 
				+<table>
			
 
				+  <tr>
			
 
				+    <th>Parameter</th>
			
 
				+    <th>Description</th>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.job.type</code>
			
 
				+    </td>
			
 
				+    <td>The value for this key can be one of LOADJOB or SLEEPJOB. The
			
 
				+    default value is LOADJOB.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.key.fraction</code>
			
 
				+    </td>
			
 
				+    <td>For a LOADJOB type of job, the fraction of a record used for
			
 
				+    the data for the key. The default value is 0.1.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.sleep.maptask-only</code>
			
 
				+    </td>
			
 
				+    <td>For a SLEEPJOB type of job, whether to ignore the reduce
			
 
				+    tasks for the job. The default is <code>false</code>.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.sleep.fake-locations</code>
			
 
				+    </td>
			
 
				+    <td>For a SLEEPJOB type of job, the number of fake locations
			
 
				+    for map tasks for the job. The default is 0.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.sleep.max-map-time</code>
			
 
				+    </td>
			
 
				+    <td>For a SLEEPJOB type of job, the maximum runtime for map
			
 
				+    tasks for the job in milliseconds. The default is unlimited.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.sleep.max-reduce-time</code>
			
 
				+    </td>
			
 
				+    <td>For a SLEEPJOB type of job, the maximum runtime for reduce
			
 
				+    tasks for the job in milliseconds. The default is unlimited.</td>
			
 
				+  </tr>
			
 
				+</table>
			
 
				+
			
 
				+
			
 
				+<a name="policies"></a>
			
 
				+
			
 
				+Job Submission Policies
			
 
				+-----------------------
			
 
				+
			
 
				+GridMix controls the rate of job submission. This control can be
			
 
				+based on the trace information or can be based on statistics it gathers
			
 
				+from the Job Tracker. Based on the submission policies users define,
			
 
				+GridMix uses the respective algorithm to control the job submission.
			
 
				+There are currently three types of policies:
			
 
				+
			
 
				+<table>
			
 
				+  <tr>
			
 
				+    <th>Job Submission Policy</th>
			
 
				+    <th>Description</th>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>STRESS</code>
			
 
				+    </td>
			
 
				+    <td>Keep submitting jobs so that the cluster remains under stress.
			
 
				+    In this mode we control the rate of job submission by monitoring
			
 
				+    the real-time load of the cluster so that we can maintain a stable
			
 
				+    stress level of workload on the cluster. Based on the statistics we
			
 
				+    gather we define if a cluster is  *underloaded*  or
			
 
				+     *overloaded* . We consider a cluster  *underloaded*  if
			
 
				+    and only if the following three conditions are true:
			
 
				+    <ol>
			
 
				+      <li>the number of pending and running jobs are under a threshold
			
 
				+      TJ</li>
			
 
				+      <li>the number of pending and running maps are under threshold
			
 
				+      TM</li>
			
 
				+      <li>the number of pending and running reduces are under threshold
			
 
				+      TR</li>
			
 
				+    </ol>
			
 
				+    The thresholds TJ, TM and TR are proportional to the size of the
			
 
				+    cluster and map, reduce slots capacities respectively. In case of a
			
 
				+    cluster being  *overloaded* , we throttle the job submission.
			
 
				+    In the actual calculation we also weigh each running task with its
			
 
				+    remaining work - namely, a 90% complete task is only counted as 0.1
			
 
				+    in calculation. Finally, to avoid a very large job blocking other
			
 
				+    jobs, we limit the number of pending/waiting tasks each job can
			
 
				+    contribute.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>REPLAY</code>
			
 
				+    </td>
			
 
				+    <td>In this mode we replay the job traces faithfully. This mode
			
 
				+    exactly follows the time-intervals given in the actual job
			
 
				+    trace.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>SERIAL</code>
			
 
				+    </td>
			
 
				+    <td>In this mode we submit the next job only once the job submitted
			
 
				+    earlier is completed.</td>
			
 
				+  </tr>
			
 
				+</table>
			
 
				+
			
 
				+The following configuration parameters affect the job submission policy:
			
 
				+
			
 
				+<table>
			
 
				+  <tr>
			
 
				+    <th>Parameter</th>
			
 
				+    <th>Description</th>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.job-submission.policy</code>
			
 
				+    </td>
			
 
				+    <td>The value for this key would be one of the three: STRESS, REPLAY
			
 
				+    or SERIAL. In most of the cases the value of key would be STRESS or
			
 
				+    REPLAY. The default value is STRESS.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.throttle.jobs-to-tracker-ratio</code>
			
 
				+    </td>
			
 
				+    <td>In STRESS mode, the minimum ratio of running jobs to Task
			
 
				+    Trackers in a cluster for the cluster to be considered
			
 
				+     *overloaded* . This is the threshold TJ referred to earlier.
			
 
				+    The default is 1.0.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.throttle.maps.task-to-slot-ratio</code>
			
 
				+    </td>
			
 
				+    <td>In STRESS mode, the minimum ratio of pending and running map
			
 
				+    tasks (i.e. incomplete map tasks) to the number of map slots for
			
 
				+    a cluster for the cluster to be considered  *overloaded* .
			
 
				+    This is the threshold TM referred to earlier. Running map tasks are
			
 
				+    counted partially. For example, a 40% complete map task is counted
			
 
				+    as 0.6 map tasks. The default is 2.0.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.throttle.reduces.task-to-slot-ratio</code>
			
 
				+    </td>
			
 
				+    <td>In STRESS mode, the minimum ratio of pending and running reduce
			
 
				+    tasks (i.e. incomplete reduce tasks) to the number of reduce slots
			
 
				+    for a cluster for the cluster to be considered  *overloaded* .
			
 
				+    This is the threshold TR referred to earlier. Running reduce tasks
			
 
				+    are counted partially. For example, a 30% complete reduce task is
			
 
				+    counted as 0.7 reduce tasks. The default is 2.5.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.throttle.maps.max-slot-share-per-job</code>
			
 
				+    </td>
			
 
				+    <td>In STRESS mode, the maximum share of a cluster's map-slots
			
 
				+    capacity that can be counted toward a job's incomplete map tasks in
			
 
				+    overload calculation. The default is 0.1.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.throttle.reducess.max-slot-share-per-job</code>
			
 
				+    </td>
			
 
				+    <td>In STRESS mode, the maximum share of a cluster's reduce-slots
			
 
				+    capacity that can be counted toward a job's incomplete reduce tasks
			
 
				+    in overload calculation. The default is 0.1.</td>
			
 
				+  </tr>
			
 
				+</table>
			
 
				+
			
 
				+
			
 
				+<a name="usersqueues"></a>
			
 
				+
			
 
				+Emulating Users and Queues
			
 
				+--------------------------
			
 
				+
			
 
				+Typical production clusters are often shared with different users and
			
 
				+the cluster capacity is divided among different departments through job
			
 
				+queues. Ensuring fairness among jobs from all users, honoring queue
			
 
				+capacity allocation policies and avoiding an ill-behaving job from
			
 
				+taking over the cluster adds significant complexity in Hadoop software.
			
 
				+To be able to sufficiently test and discover bugs in these areas,
			
 
				+GridMix must emulate the contentions of jobs from different users and/or
			
 
				+submitted to different queues.
			
 
				+
			
 
				+Emulating multiple queues is easy - we simply set up the benchmark
			
 
				+cluster with the same queue configuration as the production cluster and
			
 
				+we configure synthetic jobs so that they get submitted to the same queue
			
 
				+as recorded in the trace. However, not all users shown in the trace have
			
 
				+accounts on the benchmark cluster. Instead, we set up a number of testing
			
 
				+user accounts and associate each unique user in the trace to testing
			
 
				+users in a round-robin fashion.
			
 
				+
			
 
				+The following configuration parameters affect the emulation of users
			
 
				+and queues:
			
 
				+
			
 
				+<table>
			
 
				+  <tr>
			
 
				+    <th>Parameter</th>
			
 
				+    <th>Description</th>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.job-submission.use-queue-in-trace</code>
			
 
				+    </td>
			
 
				+    <td>When set to <code>true</code> it uses exactly the same set of
			
 
				+    queues as those mentioned in the trace. The default value is
			
 
				+    <code>false</code>.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.job-submission.default-queue</code>
			
 
				+    </td>
			
 
				+    <td>Specifies the default queue to which all the jobs would be
			
 
				+    submitted. If this parameter is not specified, GridMix uses the
			
 
				+    default queue defined for the submitting user on the cluster.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.user.resolve.class</code>
			
 
				+    </td>
			
 
				+    <td>Specifies which <code>UserResolver</code> implementation to use.
			
 
				+    We currently have three implementations:
			
 
				+    <ol>
			
 
				+      <li><code>org.apache.hadoop.mapred.gridmix.EchoUserResolver</code>
			
 
				+      - submits a job as the user who submitted the original job. All
			
 
				+      the users of the production cluster identified in the job trace
			
 
				+      must also have accounts on the benchmark cluster in this case.</li>
			
 
				+      <li><code>org.apache.hadoop.mapred.gridmix.SubmitterUserResolver</code>
			
 
				+      - submits all the jobs as current GridMix user. In this case we
			
 
				+      simply map all the users in the trace to the current GridMix user
			
 
				+      and submit the job.</li>
			
 
				+      <li><code>org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver</code>
			
 
				+      - maps trace users to test users in a round-robin fashion. In
			
 
				+      this case we set up a number of testing user accounts and
			
 
				+      associate each unique user in the trace to testing users in a
			
 
				+      round-robin fashion.</li>
			
 
				+    </ol>
			
 
				+    The default is
			
 
				+    <code>org.apache.hadoop.mapred.gridmix.SubmitterUserResolver</code>.</td>
			
 
				+  </tr>
			
 
				+</table>
			
 
				+
			
 
				+If the parameter `gridmix.user.resolve.class` is set to
			
 
				+`org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver`,
			
 
				+we need to define a users-list file with a list of test users.
			
 
				+This is specified using the `-users` option to GridMix.
			
 
				+
			
 
				+<note>
			
 
				+Specifying a users-list file using the `-users` option is
			
 
				+mandatory when using the round-robin user-resolver. Other user-resolvers
			
 
				+ignore this option.
			
 
				+</note>
			
 
				+
			
 
				+A users-list file has one user per line, each line of the format:
			
 
				+
			
 
				+    <username>
			
 
				+
			
 
				+For example:
			
 
				+
			
 
				+    user1
			
 
				+    user2
			
 
				+    user3
			
 
				+
			
 
				+In the above example we have defined three users `user1`, `user2` and `user3`.
			
 
				+Now we would associate each unique user in the trace to the above users
			
 
				+defined in round-robin fashion. For example, if trace's users are
			
 
				+`tuser1`, `tuser2`, `tuser3`, `tuser4` and `tuser5`, then the mappings would be:
			
 
				+
			
 
				+    tuser1 -> user1
			
 
				+    tuser2 -> user2
			
 
				+    tuser3 -> user3
			
 
				+    tuser4 -> user1
			
 
				+    tuser5 -> user2
			
 
				+
			
 
				+For backward compatibility reasons, each line of users-list file can
			
 
				+contain username followed by groupnames in the form username[,group]*.
			
 
				+The groupnames will be ignored by Gridmix.
			
 
				+
			
 
				+
			
 
				+Emulating Distributed Cache Load
			
 
				+--------------------------------
			
 
				+
			
 
				+Gridmix emulates Distributed Cache load by default for LOADJOB type of
			
 
				+jobs. This is done by precreating the needed Distributed Cache files for all
			
 
				+the simulated jobs as part of a separate MapReduce job.
			
 
				+
			
 
				+Emulation of Distributed Cache load in gridmix simulated jobs can be
			
 
				+disabled by configuring the property
			
 
				+`gridmix.distributed-cache-emulation.enable` to
			
 
				+`false`.
			
 
				+But generation of Distributed Cache data by gridmix is driven by
			
 
				+`-generate` option and is independent of this configuration
			
 
				+property.
			
 
				+
			
 
				+Both generation of Distributed Cache files and emulation of
			
 
				+Distributed Cache load are disabled if:
			
 
				+
			
 
				+* input trace comes from the standard input-stream instead of file, or
			
 
				+* `<iopath>` specified is on local file-system, or
			
 
				+* any of the ascendant directories of the distributed cache directory
			
 
				+  i.e. `<iopath>/distributedCache` (including the distributed
			
 
				+  cache directory) doesn't have execute permission for others.
			
 
				+
			
 
				+
			
 
				+Configuration of Simulated Jobs
			
 
				+-------------------------------
			
 
				+
			
 
				+Gridmix3 sets some configuration properties in the simulated Jobs
			
 
				+submitted by it so that they can be mapped back to the corresponding Job
			
 
				+in the input Job trace. These configuration parameters include:
			
 
				+
			
 
				+<table>
			
 
				+  <tr>
			
 
				+    <th>Parameter</th>
			
 
				+    <th>Description</th>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.job.original-job-id</code>
			
 
				+    </td>
			
 
				+    <td> The job id of the original cluster's job corresponding to this
			
 
				+    simulated job.
			
 
				+    </td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>
			
 
				+      <code>gridmix.job.original-job-name</code>
			
 
				+    </td>
			
 
				+    <td> The job name of the original cluster's job corresponding to this
			
 
				+    simulated job.
			
 
				+    </td>
			
 
				+  </tr>
			
 
				+</table>
			
 
				+
			
 
				+
			
 
				+Emulating Compression/Decompression
			
 
				+-----------------------------------
			
 
				+
			
 
				+MapReduce supports data compression and decompression. 
			
 
				+Input to a MapReduce job can be compressed. Similarly, output of Map
			
 
				+and Reduce tasks can also be compressed. Compression/Decompression 
			
 
				+emulation in GridMix is important because emulating 
			
 
				+compression/decompression will effect the CPU and Memory usage of the 
			
 
				+task. A task emulating compression/decompression will affect other 
			
 
				+tasks and daemons running on the same node.
			
 
				+
			
 
				+Compression emulation is enabled if 
			
 
				+`gridmix.compression-emulation.enable` is set to
			
 
				+`true`. By default compression emulation is enabled for 
			
 
				+jobs of type  *LOADJOB* . With compression emulation enabled, 
			
 
				+GridMix will now generate compressed text data with a constant 
			
 
				+compression ratio. Hence a simulated GridMix job will now emulate 
			
 
				+compression/decompression using compressible text data (having a 
			
 
				+constant compression ratio), irrespective of the compression ratio 
			
 
				+observed in the actual job.
			
 
				+
			
 
				+A typical MapReduce Job deals with data compression/decompression in 
			
 
				+the following phases
			
 
				+
			
 
				+* `Job input data decompression: ` GridMix generates 
			
 
				+  compressible input data when compression emulation is enabled. 
			
 
				+  Based on the original job's configuration, a simulated GridMix job 
			
 
				+  will use a decompressor to read the compressed input data. 
			
 
				+  Currently, GridMix uses
			
 
				+  `mapreduce.input.fileinputformat.inputdir` to determine 
			
 
				+  if the original job used compressed input data or 
			
 
				+  not. If the original job's input files are uncompressed then the 
			
 
				+  simulated job will read the compressed input file without using a 
			
 
				+  decompressor. 
			
 
				+
			
 
				+* `Intermediate data compression and decompression: `
			
 
				+  If the original job has map output compression enabled then GridMix 
			
 
				+  too will enable map output compression for the simulated job. 
			
 
				+  Accordingly, the reducers will use a decompressor to read the map 
			
 
				+  output data.
			
 
				+
			
 
				+* `Job output data compression: `
			
 
				+  If the original job's output is compressed then GridMix 
			
 
				+  too will enable job output compression for the simulated job. 
			
 
				+       
			
 
				+The following configuration parameters affect compression emulation
			
 
				+
			
 
				+<table>
			
 
				+  <tr>
			
 
				+    <th>Parameter</th>
			
 
				+    <th>Description</th>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td>gridmix.compression-emulation.enable</td>
			
 
				+    <td>Enables compression emulation in simulated GridMix jobs. 
			
 
				+        Default is true.</td>
			
 
				+  </tr>
			
 
				+</table>
			
 
				+      
			
 
				+With compression emulation turned on, GridMix will generate compressed
			
 
				+input data. Hence the total size of the input 
			
 
				+data will be lesser than the expected size. Set 
			
 
				+`gridmix.min.file.size` to a smaller value (roughly 10% of
			
 
				+`gridmix.gen.bytes.per.file`) for enabling GridMix to 
			
 
				+correctly emulate compression.
			
 
				+
			
 
				+
			
 
				+Emulating High-Ram jobs
			
 
				+-----------------------
			
 
				+
			
 
				+MapReduce allows users to define a job as a High-Ram job. Tasks from a 
			
 
				+High-Ram job can occupy multiple slots on the task-trackers. 
			
 
				+Task-tracker assigns fixed virtual memory for each slot. Tasks from 
			
 
				+High-Ram jobs can occupy multiple slots and thus can use up more 
			
 
				+virtual memory as compared to a default task.
			
 
				+
			
 
				+Emulating this behavior is important because of the following reasons
			
 
				+
			
 
				+* Impact on scheduler:  Scheduling of tasks from High-Ram jobs 
			
 
				+  impacts the scheduling behavior as it might result into slot 
			
 
				+  reservation and slot/resource utilization.
			
 
				+
			
 
				+* Impact on the node : Since High-Ram tasks occupy multiple slots,
			
 
				+  trackers do some bookkeeping for allocating extra resources for 
			
 
				+  these tasks. Thus this becomes a precursor for memory emulation
			
 
				+  where tasks with high memory requirements needs to be considered
			
 
				+  as a High-Ram task.
			
 
				+
			
 
				+High-Ram feature emulation can be disabled by setting  
			
 
				+`gridmix.highram-emulation.enable` to `false`.
			
 
				+    
			
 
				+
			
 
				+Emulating resource usages
			
 
				+-------------------------
			
 
				+
			
 
				+Usages of resources like CPU, physical memory, virtual memory, JVM heap
			
 
				+etc are recorded by MapReduce using its task counters. This information
			
 
				+is used by GridMix to emulate the resource usages in the simulated 
			
 
				+tasks. Emulating resource usages will help GridMix exert similar load 
			
 
				+on the test cluster as seen in the actual cluster.
			
 
				+
			
 
				+MapReduce tasks use up resources during its entire lifetime. GridMix
			
 
				+also tries to mimic this behavior by spanning resource usage emulation
			
 
				+across the entire lifetime of the simulated task. Each resource to be
			
 
				+emulated should have an  *emulator*  associated with it.
			
 
				+Each such  *emulator*  should implement the 
			
 
				+`org.apache.hadoop.mapred.gridmix.emulators.resourceusage
			
 
				+.ResourceUsageEmulatorPlugin` interface. Resource 
			
 
				+ *emulators*  in GridMix are  *plugins*  that can be 
			
 
				+configured (plugged in or out) before every run. GridMix users can 
			
 
				+configure multiple emulator  *plugins*  by passing a comma 
			
 
				+separated list of  *emulators*  as a value for the 
			
 
				+`gridmix.emulators.resource-usage.plugins` parameter. 
			
 
				+
			
 
				+List of  *emulators*  shipped with GridMix:
			
 
				+
			
 
				+* Cumulative CPU usage  *emulator* : 
			
 
				+  GridMix uses the cumulative CPU usage value published by Rumen 
			
 
				+  and makes sure that the total cumulative CPU usage of the simulated 
			
 
				+  task is close to the value published by Rumen. GridMix can be 
			
 
				+  configured to emulate cumulative CPU usage by adding 
			
 
				+  `org.apache.hadoop.mapred.gridmix.emulators.resourceusage
			
 
				+  .CumulativeCpuUsageEmulatorPlugin` to the list of emulator 
			
 
				+   *plugins*  configured for the 
			
 
				+  `gridmix.emulators.resource-usage.plugins` parameter.
			
 
				+  CPU usage emulator is designed in such a way that
			
 
				+  it only emulates at specific progress boundaries of the task. This 
			
 
				+  interval can be configured using 
			
 
				+  `gridmix.emulators.resource-usage.cpu.emulation-interval`.
			
 
				+  The default value for this parameter is `0.1` i.e 
			
 
				+  `10%`.
			
 
				+
			
 
				+* Total heap usage  *emulator* : 
			
 
				+  GridMix uses the total heap usage value published by Rumen 
			
 
				+  and makes sure that the total heap usage of the simulated 
			
 
				+  task is close to the value published by Rumen. GridMix can be 
			
 
				+  configured to emulate total heap usage by adding 
			
 
				+  `org.apache.hadoop.mapred.gridmix.emulators.resourceusage
			
 
				+  .TotalHeapUsageEmulatorPlugin` to the list of emulator 
			
 
				+   *plugins*  configured for the 
			
 
				+  `gridmix.emulators.resource-usage.plugins` parameter.
			
 
				+  Heap usage emulator is designed in such a way that
			
 
				+  it only emulates at specific progress boundaries of the task. This 
			
 
				+  interval can be configured using 
			
 
				+  `gridmix.emulators.resource-usage.heap.emulation-interval
			
 
				+  `. The default value for this parameter is `0.1` 
			
 
				+  i.e `10%` progress interval.
			
 
				+
			
 
				+Note that GridMix will emulate resource usages only for jobs of type *LOADJOB* .
			
 
				+
			
 
				+
			
 
				+Simplifying Assumptions
			
 
				+-----------------------
			
 
				+
			
 
				+GridMix will be developed in stages, incorporating feedback and
			
 
				+patches from the community. Currently its intent is to evaluate
			
 
				+MapReduce and HDFS performance and not the layers on top of them (i.e.
			
 
				+the extensive lib and sub-project space). Given these two limitations,
			
 
				+the following characteristics of job load are not currently captured in
			
 
				+job traces and cannot be accurately reproduced in GridMix:
			
 
				+
			
 
				+* *Filesystem Properties*  - No attempt is made to match block
			
 
				+  sizes, namespace hierarchies, or any property of input, intermediate
			
 
				+  or output data other than the bytes/records consumed and emitted from
			
 
				+  a given task. This implies that some of the most heavily-used parts of
			
 
				+  the system - text processing, streaming, etc. - cannot be meaningfully tested 
			
 
				+  with the current implementation.
			
 
				+  
			
 
				+* *I/O Rates*  - The rate at which records are
			
 
				+  consumed/emitted is assumed to be limited only by the speed of the
			
 
				+  reader/writer and constant throughout the task.
			
 
				+  
			
 
				+* *Memory Profile*  - No data on tasks' memory usage over time
			
 
				+  is available, though the max heap-size is retained.
			
 
				+  
			
 
				+* *Skew*  - The records consumed and emitted to/from a given
			
 
				+  task are assumed to follow observed averages, i.e. records will be
			
 
				+  more regular than may be seen in the wild. Each map also generates
			
 
				+  a proportional percentage of data for each reduce, so a job with
			
 
				+  unbalanced input will be flattened.
			
 
				+  
			
 
				+* *Job Failure*  - User code is assumed to be correct.
			
 
				+  
			
 
				+* *Job Independence*  - The output or outcome of one job does
			
 
				+  not affect when or whether a subsequent job will run.
			
 
				+
			
 
				+
			
 
				+Appendix
			
 
				+--------
			
 
				+
			
 
				+Issues tracking the original implementations of
			
 
				+<a href="https://issues.apache.org/jira/browse/HADOOP-2369">GridMix1</a>,
			
 
				+<a href="https://issues.apache.org/jira/browse/HADOOP-3770">GridMix2</a>,
			
 
				+and <a href="https://issues.apache.org/jira/browse/MAPREDUCE-776">GridMix3</a>
			
 
				+can be found on the Apache Hadoop MapReduce JIRA. Other issues tracking
			
 
				+the current development of GridMix can be found by searching
			
 
				+<a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313086">
			
 
				+the Apache Hadoop MapReduce JIRA</a>
			
--- a/hadoop-tools/hadoop-rumen/src/site/markdown/Rumen.md.vm
+++ b/hadoop-tools/hadoop-rumen/src/site/markdown/Rumen.md.vm
@@ -0,0 +1,397 @@
 
				+<!--
			
 
				+  Licensed to the Apache Software Foundation (ASF) under one or more
			
 
				+  contributor license agreements.  See the NOTICE file distributed with
			
 
				+  this work for additional information regarding copyright ownership.
			
 
				+  The ASF licenses this file to You under the Apache License, Version 2.0
			
 
				+  (the "License"); you may not use this file except in compliance with
			
 
				+  the License.  You may obtain a copy of the License at
			
 
				+
			
 
				+      http://www.apache.org/licenses/LICENSE-2.0
			
 
				+
			
 
				+  Unless required by applicable law or agreed to in writing, software
			
 
				+  distributed under the License is distributed on an "AS IS" BASIS,
			
 
				+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
			
 
				+  See the License for the specific language governing permissions and
			
 
				+  limitations under the License.
			
 
				+-->
			
 
				+
			
 
				+#set ( $H3 = '###' )
			
 
				+#set ( $H4 = '####' )
			
 
				+#set ( $H5 = '#####' )
			
 
				+
			
 
				+Rumen
			
 
				+=====
			
 
				+
			
 
				+---
			
 
				+
			
 
				+- [Overview](#Overview)
			
 
				+    - [Motivation](#Motivation)
			
 
				+    - [Components](#Components)
			
 
				+- [How to use Rumen?](#How_to_use_Rumen)
			
 
				+    - [Trace Builder](#Trace_Builder)
			
 
				+        - [Example](#Example)
			
 
				+    - [Folder](#Folder)
			
 
				+        - [Examples](#Examples)
			
 
				+- [Appendix](#Appendix)
			
 
				+    - [Resources](#Resources)
			
 
				+    - [Dependencies](#Dependencies)
			
 
				+
			
 
				+---
			
 
				+
			
 
				+Overview
			
 
				+--------
			
 
				+    
			
 
				+*Rumen* is a data extraction and analysis tool built for
			
 
				+*Apache Hadoop*. *Rumen* mines *JobHistory* logs to 
			
 
				+extract meaningful data and stores it in an easily-parsed, condensed 
			
 
				+format or *digest*. The raw trace data from MapReduce logs are 
			
 
				+often insufficient for simulation, emulation, and benchmarking, as these 
			
 
				+tools often attempt to measure conditions that did not occur in the 
			
 
				+source data. For example, if a task ran locally in the raw trace data 
			
 
				+but a simulation of the scheduler elects to run that task on a remote 
			
 
				+rack, the simulator requires a runtime its input cannot provide. 
			
 
				+To fill in these gaps, Rumen performs a statistical analysis of the 
			
 
				+digest to estimate the variables the trace doesn't supply. Rumen traces 
			
 
				+drive both Gridmix (a benchmark of Hadoop MapReduce clusters) and Mumak 
			
 
				+(a simulator for the JobTracker).
			
 
				+
			
 
				+
			
 
				+$H3 Motivation
			
 
				+     
			
 
				+* Extracting meaningful data from *JobHistory* logs is a common
			
 
				+  task for any tool built to work on *MapReduce*. It 
			
 
				+  is tedious to write a custom tool which is so tightly coupled with 
			
 
				+  the *MapReduce* framework. Hence there is a need for a 
			
 
				+  built-in tool for performing framework level task of log parsing and
			
 
				+  analysis. Such a tool would insulate external systems depending on 
			
 
				+  job history against the changes made to the job history format.
			
 
				+
			
 
				+* Performing statistical analysis of various attributes of a 
			
 
				+  *MapReduce Job* such as *task runtimes, task failures 
			
 
				+  etc* is another common task that the benchmarking 
			
 
				+  and simulation tools might need. *Rumen* generates 
			
 
				+  <a href="http://en.wikipedia.org/wiki/Cumulative_distribution_function">
			
 
				+  *Cumulative Distribution Functions (CDF)*
			
 
				+  </a> for the Map/Reduce task runtimes. 
			
 
				+  Runtime CDF can be used for extrapolating the task runtime of 
			
 
				+  incomplete, missing and synthetic tasks. Similarly CDF is also 
			
 
				+  computed for the total number of successful tasks for every attempt.
			
 
				+
			
 
				+
			
 
				+$H3 Components
			
 
				+
			
 
				+*Rumen* consists of 2 components
			
 
				+
			
 
				+* *Trace Builder* : 
			
 
				+  Converts *JobHistory* logs into an easily-parsed format.
			
 
				+  Currently `TraceBuilder` outputs the trace in 
			
 
				+  <a href="http://www.json.org/">*JSON*</a> 
			
 
				+  format.   
			
 
				+
			
 
				+* *Folder *: 
			
 
				+  A utility to scale the input trace. A trace obtained from
			
 
				+  *TraceBuilder* simply summarizes the jobs in the 
			
 
				+  input folders and files. The time-span within which all the jobs in 
			
 
				+  a given trace finish can be considered as the trace runtime. 
			
 
				+  *Folder* can be used to scale the runtime of a trace.
			
 
				+  Decreasing the trace runtime might involve dropping some jobs from 
			
 
				+  the input trace and scaling down the runtime of remaining jobs. 
			
 
				+  Increasing the trace runtime might involve adding some dummy jobs to
			
 
				+  the resulting trace and scaling up the runtime of individual jobs.
			
 
				+
			
 
				+
			
 
				+How to use Rumen?
			
 
				+-----------------
			
 
				+
			
 
				+Converting *JobHistory* logs into a desired job-trace consists of  2 steps
			
 
				+
			
 
				+1.  Extracting information into an intermediate format
			
 
				+
			
 
				+2.  Adjusting the job-trace obtained from the intermediate trace to 
			
 
				+    have the desired properties.
			
 
				+       
			
 
				+> Extracting information from *JobHistory* logs is a one time
			
 
				+> operation. This so called *Gold Trace* can be reused to
			
 
				+> generate traces with desired values of properties such as 
			
 
				+> `output-duration`, `concentration` etc.
			
 
				+       
			
 
				+*Rumen* provides 2 basic commands
			
 
				+
			
 
				+* `TraceBuilder`
			
 
				+* `Folder`
			
 
				+       
			
 
				+Firstly, we need to generate the *Gold Trace*. Hence the first 
			
 
				+step is to run `TraceBuilder` on a job-history folder. 
			
 
				+The output of the `TraceBuilder` is a job-trace file (and an 
			
 
				+optional cluster-topology file). In case we want to scale the output, we 
			
 
				+can use the `Folder` utility to fold the current trace to the 
			
 
				+desired length. The remaining part of this section explains these 
			
 
				+utilities in detail.
			
 
				+    
			
 
				+> Examples in this section assumes that certain libraries are present 
			
 
				+> in the java CLASSPATH. See <em>Section-3.2</em> for more details.
			
 
				+
			
 
				+
			
 
				+$H3 Trace Builder
			
 
				+      
			
 
				+`Command:`
			
 
				+
			
 
				+    java org.apache.hadoop.tools.rumen.TraceBuilder [options] <jobtrace-output> <topology-output> <inputs>
			
 
				+  
			
 
				+This command invokes the `TraceBuilder` utility of
			
 
				+*Rumen*. It converts the JobHistory files into a series of JSON
			
 
				+objects and writes them into the `<jobtrace-output>`
			
 
				+file. It also extracts the cluster layout (topology) and writes it in
			
 
				+the`<topology-output>` file.
			
 
				+`<inputs>` represents a space-separated list of
			
 
				+JobHistory files and folders.
			
 
				+         
			
 
				+> 1) Input and output to `TraceBuilder` is expected to
			
 
				+> be a fully qualified FileSystem path. So use file://
			
 
				+> to specify files on the `local` FileSystem and 
			
 
				+> hdfs:// to specify files on HDFS. Since input files or
			
 
				+> folder are FileSystem paths, it means that they can be globbed.
			
 
				+> This can be useful while specifying multiple file paths using
			
 
				+> regular expressions.
			
 
				+
			
 
				+> 2) By default, TraceBuilder does not recursively scan the input
			
 
				+> folder for job history files. Only the files that are directly
			
 
				+> placed under the input folder will be considered for generating
			
 
				+> the trace. To add all the files under the input directory by
			
 
				+> recursively scanning the input directory, use ‘-recursive’
			
 
				+> option.
			
 
				+
			
 
				+Cluster topology is used as follows :
			
 
				+
			
 
				+* To reconstruct the splits and make sure that the 
			
 
				+  distances/latencies seen in the actual run are modeled correctly.
			
 
				+
			
 
				+* To extrapolate splits information for tasks with missing splits
			
 
				+  details or synthetically generated tasks.
			
 
				+      
			
 
				+`Options :`
			
 
				+
			
 
				+<table>
			
 
				+  <tr>
			
 
				+    <th> Parameter</th>
			
 
				+    <th> Description</th>
			
 
				+    <th> Notes </th>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td><code>-demuxer</code></td>
			
 
				+    <td>Used to read the jobhistory files. The default is 
			
 
				+        <code>DefaultInputDemuxer</code>.</td>
			
 
				+    <td>Demuxer decides how the input file maps to jobhistory file(s). 
			
 
				+        Job history logs and job configuration files are typically small 
			
 
				+        files, and can be more effectively stored when embedded in some
			
 
				+        container file format like SequenceFile or TFile. To support such 
			
 
				+        usage cases, one can specify a customized Demuxer class that can 
			
 
				+        extract individual job history logs and job configuration files 
			
 
				+        from the source files.
			
 
				+    </td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+     <td><code>-recursive</code></td>
			
 
				+     <td>Recursively traverse input paths for job history logs.</td>
			
 
				+     <td>This option should be used to inform the TraceBuilder to
			
 
				+     recursively scan the input paths and process all the files under it.
			
 
				+     Note that, by default, only the history logs that are directly under
			
 
				+     the input folder are considered for generating the trace.
			
 
				+     </td>
			
 
				+  </tr>
			
 
				+</table>
			
 
				+      
			
 
				+
			
 
				+$H4 Example
			
 
				+
			
 
				+    java org.apache.hadoop.tools.rumen.TraceBuilder file:///home/user/job-trace.json file:///home/user/topology.output file:///home/user/logs/history/done
			
 
				+
			
 
				+This will analyze all the jobs in 
			
 
				+
			
 
				+`/home/user/logs/history/done` stored on the 
			
 
				+`local` FileSystem and output the jobtraces in 
			
 
				+`/home/user/job-trace.json` along with topology 
			
 
				+information in `/home/user/topology.output`.
			
 
				+
			
 
				+
			
 
				+$H3 Folder
			
 
				+      
			
 
				+`Command`:
			
 
				+
			
 
				+    java org.apache.hadoop.tools.rumen.Folder [options] [input] [output]
			
 
				+      
			
 
				+> Input and output to `Folder` is expected to be a fully 
			
 
				+> qualified FileSystem path. So use file:// to specify 
			
 
				+> files on the `local` FileSystem and hdfs:// to
			
 
				+> specify files on HDFS.
			
 
				+      
			
 
				+This command invokes the `Folder` utility of 
			
 
				+*Rumen*. Folding essentially means that the output duration of 
			
 
				+the resulting trace is fixed and job timelines are adjusted 
			
 
				+to respect the final output duration. 
			
 
				+
			
 
				+`Options :`
			
 
				+
			
 
				+<table>
			
 
				+  <tr>
			
 
				+    <th> Parameter</th>
			
 
				+    <th> Description</th>
			
 
				+    <th> Notes </th>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td><code>-input-cycle</code></td>
			
 
				+    <td>Defines the basic unit of time for the folding operation. There is
			
 
				+        no default value for <code>input-cycle</code>. 
			
 
				+        <strong>Input cycle must be provided</strong>.
			
 
				+    </td>
			
 
				+    <td>'<code>-input-cycle 10m</code>' 
			
 
				+        implies that the whole trace run will be now sliced at a 10min 
			
 
				+        interval. Basic operations will be done on the 10m chunks. Note 
			
 
				+        that *Rumen* understands various time units like 
			
 
				+        <em>m(min), h(hour), d(days) etc</em>.
			
 
				+    </td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td><code>-output-duration</code></td>
			
 
				+    <td>This parameter defines the final runtime of the trace. 
			
 
				+        Default value if <strong>1 hour</strong>.
			
 
				+    </td>
			
 
				+    <td>'<code>-output-duration 30m</code>' 
			
 
				+        implies that the resulting trace will have a max runtime of 
			
 
				+        30mins. All the jobs in the input trace file will be folded and 
			
 
				+        scaled to fit this window.
			
 
				+    </td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td><code>-concentration</code></td>
			
 
				+    <td>Set the concentration of the resulting trace. Default value is 
			
 
				+        <strong>1</strong>.
			
 
				+    </td>
			
 
				+    <td>If the total runtime of the resulting trace is less than the total
			
 
				+        runtime of the input trace, then the resulting trace would contain
			
 
				+        lesser number of jobs as compared to the input trace. This 
			
 
				+        essentially means that the output is diluted. To increase the 
			
 
				+        density of jobs, set the concentration to a higher value.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td><code>-debug</code></td>
			
 
				+    <td>Run the Folder in debug mode. By default it is set to 
			
 
				+        <strong>false</strong>.</td>
			
 
				+    <td>In debug mode, the Folder will print additional statements for 
			
 
				+        debugging. Also the intermediate files generated in the scratch 
			
 
				+        directory will not be cleaned up.
			
 
				+    </td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td><code>-seed</code></td>
			
 
				+    <td>Initial seed to the Random Number Generator. By default, a Random 
			
 
				+        Number Generator is used to generate a seed and the seed value is
			
 
				+        reported back to the user for future use.
			
 
				+    </td>
			
 
				+    <td>If an initial seed is passed, then the <code>Random Number 
			
 
				+        Generator</code> will generate the random numbers in the same 
			
 
				+        sequence i.e the sequence of random numbers remains same if the 
			
 
				+        same seed is used. Folder uses Random Number Generator to decide 
			
 
				+        whether or not to emit the job. 
			
 
				+    </td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td><code>-temp-directory</code></td>
			
 
				+    <td>Temporary directory for the Folder. By default the <strong>output
			
 
				+        folder's parent directory</strong> is used as the scratch space.
			
 
				+    </td>
			
 
				+    <td>This is the scratch space used by Folder.  All the 
			
 
				+        temporary files are cleaned up in the end unless the Folder is run
			
 
				+        in <code>debug</code> mode.</td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td><code>-skew-buffer-length</code></td>
			
 
				+    <td>Enables <em>Folder</em> to tolerate skewed jobs.
			
 
				+        The default buffer length is <strong>0</strong>.</td>
			
 
				+    <td>'<code>-skew-buffer-length 100</code>' 
			
 
				+        indicates that if the jobs appear out of order within a window 
			
 
				+        size of 100, then they will be emitted in-order by the folder. 
			
 
				+        If a job appears out-of-order outside this window, then the Folder
			
 
				+        will bail out provided <code>-allow-missorting</code> is not set.
			
 
				+        <em>Folder</em> reports the maximum skew size seen in the 
			
 
				+        input trace for future use.
			
 
				+    </td>
			
 
				+  </tr>
			
 
				+  <tr>
			
 
				+    <td><code>-allow-missorting</code></td>
			
 
				+    <td>Enables <em>Folder</em> to tolerate out-of-order jobs. By default 
			
 
				+        mis-sorting is not allowed.
			
 
				+    </td>
			
 
				+    <td>If mis-sorting is allowed, then the <em>Folder</em> will ignore 
			
 
				+        out-of-order jobs that cannot be deskewed using a skew buffer of
			
 
				+        size specified using <code>-skew-buffer-length</code>. If 
			
 
				+        mis-sorting is not allowed, then the Folder will bail out if the
			
 
				+        skew buffer is incapable of tolerating the skew.
			
 
				+    </td>
			
 
				+  </tr>
			
 
				+</table>
			
 
				+
			
 
				+
			
 
				+$H4 Examples
			
 
				+
			
 
				+$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime
			
 
				+               
			
 
				+    java org.apache.hadoop.tools.rumen.Folder -output-duration 1h  -input-cycle 20m  file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json
			
 
				+
			
 
				+If the folded jobs are out of order then the command will bail out. 
			
 
				+
			
 
				+$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime and tolerate some skewness
			
 
				+
			
 
				+    java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m  -allow-missorting -skew-buffer-length 100 file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json
			
 
				+
			
 
				+If the folded jobs are out of order, then atmost
			
 
				+100 jobs will be de-skewed. If the 101<sup>st</sup> job is 
			
 
				+*out-of-order*, then the command will bail out.
			
 
				+
			
 
				+$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime in debug mode
			
 
				+
			
 
				+    java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m  -debug -temp-directory file:///tmp/debug file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json
			
 
				+
			
 
				+This will fold the 10hr job-trace file 
			
 
				+`file:///home/user/job-trace.json` to finish within 1hr 
			
 
				+and use `file:///tmp/debug` as the temporary directory. 
			
 
				+The intermediate files in the temporary directory will not be cleaned
			
 
				+up.
			
 
				+      
			
 
				+$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime with custom concentration.
			
 
				+
			
 
				+    java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m  -concentration 2  file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json</source>
			
 
				+
			
 
				+This will fold the 10hr job-trace file 
			
 
				+`file:///home/user/job-trace.json` to finish within 1hr 
			
 
				+with concentration of 2. `Example-2.3.2` will retain 10% 
			
 
				+of the jobs. With *concentration* as 2, 20% of the total input 
			
 
				+jobs will be retained.
			
 
				+
			
 
				+
			
 
				+Appendix
			
 
				+--------
			
 
				+    
			
 
				+$H3 Resources
			
 
				+
			
 
				+<a href="https://issues.apache.org/jira/browse/MAPREDUCE-751">MAPREDUCE-751</a>
			
 
				+is the main JIRA that introduced *Rumen* to *MapReduce*. 
			
 
				+Look at the MapReduce
			
 
				+<a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313617">
			
 
				+rumen-component</a>for further details.
			
 
				+
			
 
				+
			
 
				+$H3 Dependencies
			
 
				+
			
 
				+*Rumen* expects certain library *JARs* to be present in 
			
 
				+the *CLASSPATH*. The required libraries are
			
 
				+
			
 
				+* `Hadoop MapReduce Tools` (`hadoop-mapred-tools-{hadoop-version}.jar`)
			
 
				+* `Hadoop Common` (`hadoop-common-{hadoop-version}.jar`)
			
 
				+* `Apache Commons Logging` (`commons-logging-1.1.1.jar`)
			
 
				+* `Apache Commons CLI` (`commons-cli-1.2.jar`)
			
 
				+* `Jackson Mapper` (`jackson-mapper-asl-1.4.2.jar`)
			
 
				+* `Jackson Core` (`jackson-core-asl-1.4.2.jar`)
			
 
				+
			
 
				+> One simple way to run Rumen is to use '$HADOOP_HOME/bin/hadoop jar' 
			
 
				+> option  to run it.