10 years ago · 3f982c5c26
--- a/hadoop-mapreduce-project/CHANGES.txt
+++ b/hadoop-mapreduce-project/CHANGES.txt
@@ -264,6 +264,8 @@ Release 2.7.0 - UNRELEASED
 
				 
			
 
				     MAPREDUCE-6141. History server leveldb recovery store (jlowe)
			
 
				 
			
 
				+    MAPREDUCE-6150. Update document of Rumen (Masatake Iwasaki via aw)
			
 
				+
			
 
				   OPTIMIZATIONS
			
 
				 
			
 
				     MAPREDUCE-6169. MergeQueue should release reference to the current item 
			
--- a/hadoop-project/src/site/site.xml
+++ b/hadoop-project/src/site/site.xml
@@ -105,6 +105,7 @@
 
				       <item name="Hadoop Streaming" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html"/>
			
 
				       <item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/>
			
 
				       <item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/>
			
 
				+      <item name="Rumen" href="hadoop-rumen/Rumen.html"/>
			
 
				     </menu>
			
 
				 
			
 
				     <menu name="MapReduce REST APIs" inherit="top">
			
--- a/hadoop-tools/hadoop-rumen/src/site/markdown/Rumen.md.vm
+++ b/hadoop-tools/hadoop-rumen/src/site/markdown/Rumen.md.vm
@@ -29,9 +29,7 @@ Rumen
 
				     - [Components](#Components)
			
 
				 - [How to use Rumen?](#How_to_use_Rumen)
			
 
				     - [Trace Builder](#Trace_Builder)
			
 
				-        - [Example](#Example)
			
 
				     - [Folder](#Folder)
			
 
				-        - [Examples](#Examples)
			
 
				 - [Appendix](#Appendix)
			
 
				     - [Resources](#Resources)
			
 
				     - [Dependencies](#Dependencies)
			
@@ -128,18 +126,21 @@ can use the `Folder` utility to fold the current trace to the
 
				 desired length. The remaining part of this section explains these 
			
 
				 utilities in detail.
			
 
				     
			
 
				-> Examples in this section assumes that certain libraries are present 
			
 
				-> in the java CLASSPATH. See <em>Section-3.2</em> for more details.
			
 
				+Examples in this section assumes that certain libraries are present 
			
 
				+in the java CLASSPATH. See [Dependencies](#Dependencies) for more details.
			
 
				 
			
 
				 
			
 
				 $H3 Trace Builder
			
 
				       
			
 
				-`Command:`
			
 
				+$H4 Command
			
 
				 
			
 
				-    java org.apache.hadoop.tools.rumen.TraceBuilder [options] <jobtrace-output> <topology-output> <inputs>
			
 
				+```
			
 
				+java org.apache.hadoop.tools.rumen.TraceBuilder [options] <jobtrace-output> <topology-output> <inputs>
			
 
				+```
			
 
				   
			
 
				-This command invokes the `TraceBuilder` utility of
			
 
				-*Rumen*. It converts the JobHistory files into a series of JSON
			
 
				+This command invokes the `TraceBuilder` utility of *Rumen*.
			
 
				+
			
 
				+TraceBuilder converts the JobHistory files into a series of JSON
			
 
				 objects and writes them into the `<jobtrace-output>`
			
 
				 file. It also extracts the cluster layout (topology) and writes it in
			
 
				 the`<topology-output>` file.
			
@@ -169,7 +170,7 @@ Cluster topology is used as follows :
 
				 * To extrapolate splits information for tasks with missing splits
			
 
				   details or synthetically generated tasks.
			
 
				       
			
 
				-`Options :`
			
 
				+$H4 Options
			
 
				 
			
 
				 <table>
			
 
				   <tr>
			
@@ -204,33 +205,45 @@ Cluster topology is used as follows :
 
				 
			
 
				 $H4 Example
			
 
				 
			
 
				-    java org.apache.hadoop.tools.rumen.TraceBuilder file:///home/user/job-trace.json file:///home/user/topology.output file:///home/user/logs/history/done
			
 
				+*Rumen* expects certain library *JARs* to be present in  the *CLASSPATH*.
			
 
				+One simple way to run Rumen is to use
			
 
				+`$HADOOP_HOME/bin/hadoop jar` command to run it as example below.
			
 
				 
			
 
				-This will analyze all the jobs in 
			
 
				+```
			
 
				+java org.apache.hadoop.tools.rumen.TraceBuilder \
			
 
				+  file:///tmp/job-trace.json \
			
 
				+  file:///tmp/job-topology.json \
			
 
				+  hdfs:///tmp/hadoop-yarn/staging/history/done_intermediate/testuser
			
 
				+```
			
 
				 
			
 
				-`/home/user/logs/history/done` stored on the 
			
 
				-`local` FileSystem and output the jobtraces in 
			
 
				-`/home/user/job-trace.json` along with topology 
			
 
				-information in `/home/user/topology.output`.
			
 
				+This will analyze all the jobs in 
			
 
				+`/tmp/hadoop-yarn/staging/history/done_intermediate/testuser`
			
 
				+stored on the  `HDFS` FileSystem
			
 
				+and output the jobtraces in  `/tmp/job-trace.json`
			
 
				+along with topology information in `/tmp/job-topology.json`
			
 
				+stored on the `local` FileSystem.
			
 
				 
			
 
				 
			
 
				 $H3 Folder
			
 
				       
			
 
				-`Command`:
			
 
				+$H4 Command
			
 
				 
			
 
				-    java org.apache.hadoop.tools.rumen.Folder [options] [input] [output]
			
 
				-      
			
 
				-> Input and output to `Folder` is expected to be a fully 
			
 
				-> qualified FileSystem path. So use file:// to specify 
			
 
				-> files on the `local` FileSystem and hdfs:// to
			
 
				-> specify files on HDFS.
			
 
				+```
			
 
				+java org.apache.hadoop.tools.rumen.Folder [options] [input] [output]
			
 
				+```
			
 
				       
			
 
				 This command invokes the `Folder` utility of 
			
 
				 *Rumen*. Folding essentially means that the output duration of 
			
 
				 the resulting trace is fixed and job timelines are adjusted 
			
 
				 to respect the final output duration. 
			
 
				 
			
 
				-`Options :`
			
 
				+> Input and output to `Folder` is expected to be a fully 
			
 
				+> qualified FileSystem path. So use `file://` to specify 
			
 
				+> files on the `local` FileSystem and `hdfs://` to
			
 
				+> specify files on HDFS.
			
 
				+
			
 
				+
			
 
				+$H4 Options
			
 
				 
			
 
				 <table>
			
 
				   <tr>
			
@@ -335,14 +348,28 @@ to respect the final output duration.
 
				 $H4 Examples
			
 
				 
			
 
				 $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime
			
 
				-               
			
 
				-    java org.apache.hadoop.tools.rumen.Folder -output-duration 1h  -input-cycle 20m  file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json
			
 
				+
			
 
				+```
			
 
				+java org.apache.hadoop.tools.rumen.Folder \
			
 
				+  -output-duration 1h \
			
 
				+  -input-cycle 20m \
			
 
				+  file:///tmp/job-trace.json \
			
 
				+  file:///tmp/job-trace-1hr.json
			
 
				+```
			
 
				 
			
 
				 If the folded jobs are out of order then the command will bail out. 
			
 
				 
			
 
				 $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime and tolerate some skewness
			
 
				 
			
 
				-    java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m  -allow-missorting -skew-buffer-length 100 file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json
			
 
				+```
			
 
				+java org.apache.hadoop.tools.rumen.Folder \
			
 
				+  -output-duration 1h \
			
 
				+  -input-cycle 20m \
			
 
				+  -allow-missorting \
			
 
				+  -skew-buffer-length 100 \
			
 
				+  file:///tmp/job-trace.json \
			
 
				+  file:///tmp/job-trace-1hr.json
			
 
				+```
			
 
				 
			
 
				 If the folded jobs are out of order, then atmost
			
 
				 100 jobs will be de-skewed. If the 101<sup>st</sup> job is 
			
@@ -350,23 +377,37 @@ If the folded jobs are out of order, then atmost
 
				 
			
 
				 $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime in debug mode
			
 
				 
			
 
				-    java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m  -debug -temp-directory file:///tmp/debug file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json
			
 
				+```
			
 
				+java org.apache.hadoop.tools.rumen.Folder \
			
 
				+  -output-duration 1h \
			
 
				+  -input-cycle 20m \
			
 
				+  -debug -temp-directory file:///tmp/debug \
			
 
				+  file:///tmp/job-trace.json \
			
 
				+  file:///tmp/job-trace-1hr.json
			
 
				+```
			
 
				 
			
 
				 This will fold the 10hr job-trace file 
			
 
				-`file:///home/user/job-trace.json` to finish within 1hr 
			
 
				+`file:///tmp/job-trace.json` to finish within 1hr 
			
 
				 and use `file:///tmp/debug` as the temporary directory. 
			
 
				 The intermediate files in the temporary directory will not be cleaned
			
 
				 up.
			
 
				       
			
 
				 $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime with custom concentration.
			
 
				 
			
 
				-    java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m  -concentration 2  file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json</source>
			
 
				+```
			
 
				+java org.apache.hadoop.tools.rumen.Folder \
			
 
				+  -output-duration 1h \
			
 
				+  -input-cycle 20m \
			
 
				+  -concentration 2 \
			
 
				+  file:///tmp/job-trace.json \
			
 
				+  file:///tmp/job-trace-1hr.json
			
 
				+```
			
 
				 
			
 
				 This will fold the 10hr job-trace file 
			
 
				-`file:///home/user/job-trace.json` to finish within 1hr 
			
 
				-with concentration of 2. `Example-2.3.2` will retain 10% 
			
 
				-of the jobs. With *concentration* as 2, 20% of the total input 
			
 
				-jobs will be retained.
			
 
				+`file:///tmp/job-trace.json` to finish within 1hr 
			
 
				+with concentration of 2.
			
 
				+If the 10h job-trace is folded to 1h, it retains 10% of the jobs by default.
			
 
				+With *concentration* as 2, 20% of the total input jobs will be retained.
			
 
				 
			
 
				 
			
 
				 Appendix
			
@@ -377,21 +418,21 @@ $H3 Resources
 
				 <a href="https://issues.apache.org/jira/browse/MAPREDUCE-751">MAPREDUCE-751</a>
			
 
				 is the main JIRA that introduced *Rumen* to *MapReduce*. 
			
 
				 Look at the MapReduce
			
 
				-<a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313617">
			
 
				-rumen-component</a>for further details.
			
 
				+<a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313617">rumen-component</a>
			
 
				+for further details.
			
 
				 
			
 
				 
			
 
				 $H3 Dependencies
			
 
				 
			
 
				-*Rumen* expects certain library *JARs* to be present in 
			
 
				-the *CLASSPATH*. The required libraries are
			
 
				-
			
 
				-* `Hadoop MapReduce Tools` (`hadoop-mapred-tools-{hadoop-version}.jar`)
			
 
				-* `Hadoop Common` (`hadoop-common-{hadoop-version}.jar`)
			
 
				-* `Apache Commons Logging` (`commons-logging-1.1.1.jar`)
			
 
				-* `Apache Commons CLI` (`commons-cli-1.2.jar`)
			
 
				-* `Jackson Mapper` (`jackson-mapper-asl-1.4.2.jar`)
			
 
				-* `Jackson Core` (`jackson-core-asl-1.4.2.jar`)
			
 
				-
			
 
				-> One simple way to run Rumen is to use '$HADOOP_HOME/bin/hadoop jar' 
			
 
				-> option  to run it.
			
 
				+*Rumen* expects certain library *JARs* to be present in  the *CLASSPATH*.
			
 
				+One simple way to run Rumen is to use
			
 
				+`hadoop jar` command to run it as example below.
			
 
				+
			
 
				+```
			
 
				+$HADOOP_HOME/bin/hadoop jar \
			
 
				+  $HADOOP_HOME/share/hadoop/tools/lib/hadoop-rumen-2.5.1.jar \
			
 
				+  org.apache.hadoop.tools.rumen.TraceBuilder \
			
 
				+  file:///tmp/job-trace.json \
			
 
				+  file:///tmp/job-topology.json \
			
 
				+  hdfs:///tmp/hadoop-yarn/staging/history/done_intermediate/testuser
			
 
				+```