Browse Source

MAPREDUCE-6150. Update document of Rumen (Masatake Iwasaki via aw)

Allen Wittenauer 10 years ago
parent
commit
3f982c5c26

+ 2 - 0
hadoop-mapreduce-project/CHANGES.txt

@@ -264,6 +264,8 @@ Release 2.7.0 - UNRELEASED
 
 
     MAPREDUCE-6141. History server leveldb recovery store (jlowe)
     MAPREDUCE-6141. History server leveldb recovery store (jlowe)
 
 
+    MAPREDUCE-6150. Update document of Rumen (Masatake Iwasaki via aw)
+
   OPTIMIZATIONS
   OPTIMIZATIONS
 
 
     MAPREDUCE-6169. MergeQueue should release reference to the current item 
     MAPREDUCE-6169. MergeQueue should release reference to the current item 

+ 1 - 0
hadoop-project/src/site/site.xml

@@ -105,6 +105,7 @@
       <item name="Hadoop Streaming" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html"/>
       <item name="Hadoop Streaming" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html"/>
       <item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/>
       <item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/>
       <item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/>
       <item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/>
+      <item name="Rumen" href="hadoop-rumen/Rumen.html"/>
     </menu>
     </menu>
 
 
     <menu name="MapReduce REST APIs" inherit="top">
     <menu name="MapReduce REST APIs" inherit="top">

+ 88 - 47
hadoop-tools/hadoop-rumen/src/site/markdown/Rumen.md.vm

@@ -29,9 +29,7 @@ Rumen
     - [Components](#Components)
     - [Components](#Components)
 - [How to use Rumen?](#How_to_use_Rumen)
 - [How to use Rumen?](#How_to_use_Rumen)
     - [Trace Builder](#Trace_Builder)
     - [Trace Builder](#Trace_Builder)
-        - [Example](#Example)
     - [Folder](#Folder)
     - [Folder](#Folder)
-        - [Examples](#Examples)
 - [Appendix](#Appendix)
 - [Appendix](#Appendix)
     - [Resources](#Resources)
     - [Resources](#Resources)
     - [Dependencies](#Dependencies)
     - [Dependencies](#Dependencies)
@@ -128,18 +126,21 @@ can use the `Folder` utility to fold the current trace to the
 desired length. The remaining part of this section explains these 
 desired length. The remaining part of this section explains these 
 utilities in detail.
 utilities in detail.
     
     
-> Examples in this section assumes that certain libraries are present 
-> in the java CLASSPATH. See <em>Section-3.2</em> for more details.
+Examples in this section assumes that certain libraries are present 
+in the java CLASSPATH. See [Dependencies](#Dependencies) for more details.
 
 
 
 
 $H3 Trace Builder
 $H3 Trace Builder
       
       
-`Command:`
+$H4 Command
 
 
-    java org.apache.hadoop.tools.rumen.TraceBuilder [options] <jobtrace-output> <topology-output> <inputs>
+```
+java org.apache.hadoop.tools.rumen.TraceBuilder [options] <jobtrace-output> <topology-output> <inputs>
+```
   
   
-This command invokes the `TraceBuilder` utility of
-*Rumen*. It converts the JobHistory files into a series of JSON
+This command invokes the `TraceBuilder` utility of *Rumen*.
+
+TraceBuilder converts the JobHistory files into a series of JSON
 objects and writes them into the `<jobtrace-output>`
 objects and writes them into the `<jobtrace-output>`
 file. It also extracts the cluster layout (topology) and writes it in
 file. It also extracts the cluster layout (topology) and writes it in
 the`<topology-output>` file.
 the`<topology-output>` file.
@@ -169,7 +170,7 @@ Cluster topology is used as follows :
 * To extrapolate splits information for tasks with missing splits
 * To extrapolate splits information for tasks with missing splits
   details or synthetically generated tasks.
   details or synthetically generated tasks.
       
       
-`Options :`
+$H4 Options
 
 
 <table>
 <table>
   <tr>
   <tr>
@@ -204,33 +205,45 @@ Cluster topology is used as follows :
 
 
 $H4 Example
 $H4 Example
 
 
-    java org.apache.hadoop.tools.rumen.TraceBuilder file:///home/user/job-trace.json file:///home/user/topology.output file:///home/user/logs/history/done
+*Rumen* expects certain library *JARs* to be present in  the *CLASSPATH*.
+One simple way to run Rumen is to use
+`$HADOOP_HOME/bin/hadoop jar` command to run it as example below.
 
 
-This will analyze all the jobs in 
+```
+java org.apache.hadoop.tools.rumen.TraceBuilder \
+  file:///tmp/job-trace.json \
+  file:///tmp/job-topology.json \
+  hdfs:///tmp/hadoop-yarn/staging/history/done_intermediate/testuser
+```
 
 
-`/home/user/logs/history/done` stored on the 
-`local` FileSystem and output the jobtraces in 
-`/home/user/job-trace.json` along with topology 
-information in `/home/user/topology.output`.
+This will analyze all the jobs in 
+`/tmp/hadoop-yarn/staging/history/done_intermediate/testuser`
+stored on the  `HDFS` FileSystem
+and output the jobtraces in  `/tmp/job-trace.json`
+along with topology information in `/tmp/job-topology.json`
+stored on the `local` FileSystem.
 
 
 
 
 $H3 Folder
 $H3 Folder
       
       
-`Command`:
+$H4 Command
 
 
-    java org.apache.hadoop.tools.rumen.Folder [options] [input] [output]
-      
-> Input and output to `Folder` is expected to be a fully 
-> qualified FileSystem path. So use file:// to specify 
-> files on the `local` FileSystem and hdfs:// to
-> specify files on HDFS.
+```
+java org.apache.hadoop.tools.rumen.Folder [options] [input] [output]
+```
       
       
 This command invokes the `Folder` utility of 
 This command invokes the `Folder` utility of 
 *Rumen*. Folding essentially means that the output duration of 
 *Rumen*. Folding essentially means that the output duration of 
 the resulting trace is fixed and job timelines are adjusted 
 the resulting trace is fixed and job timelines are adjusted 
 to respect the final output duration. 
 to respect the final output duration. 
 
 
-`Options :`
+> Input and output to `Folder` is expected to be a fully 
+> qualified FileSystem path. So use `file://` to specify 
+> files on the `local` FileSystem and `hdfs://` to
+> specify files on HDFS.
+
+
+$H4 Options
 
 
 <table>
 <table>
   <tr>
   <tr>
@@ -335,14 +348,28 @@ to respect the final output duration.
 $H4 Examples
 $H4 Examples
 
 
 $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime
 $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime
-               
-    java org.apache.hadoop.tools.rumen.Folder -output-duration 1h  -input-cycle 20m  file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json
+
+```
+java org.apache.hadoop.tools.rumen.Folder \
+  -output-duration 1h \
+  -input-cycle 20m \
+  file:///tmp/job-trace.json \
+  file:///tmp/job-trace-1hr.json
+```
 
 
 If the folded jobs are out of order then the command will bail out. 
 If the folded jobs are out of order then the command will bail out. 
 
 
 $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime and tolerate some skewness
 $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime and tolerate some skewness
 
 
-    java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m  -allow-missorting -skew-buffer-length 100 file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json
+```
+java org.apache.hadoop.tools.rumen.Folder \
+  -output-duration 1h \
+  -input-cycle 20m \
+  -allow-missorting \
+  -skew-buffer-length 100 \
+  file:///tmp/job-trace.json \
+  file:///tmp/job-trace-1hr.json
+```
 
 
 If the folded jobs are out of order, then atmost
 If the folded jobs are out of order, then atmost
 100 jobs will be de-skewed. If the 101<sup>st</sup> job is 
 100 jobs will be de-skewed. If the 101<sup>st</sup> job is 
@@ -350,23 +377,37 @@ If the folded jobs are out of order, then atmost
 
 
 $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime in debug mode
 $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime in debug mode
 
 
-    java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m  -debug -temp-directory file:///tmp/debug file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json
+```
+java org.apache.hadoop.tools.rumen.Folder \
+  -output-duration 1h \
+  -input-cycle 20m \
+  -debug -temp-directory file:///tmp/debug \
+  file:///tmp/job-trace.json \
+  file:///tmp/job-trace-1hr.json
+```
 
 
 This will fold the 10hr job-trace file 
 This will fold the 10hr job-trace file 
-`file:///home/user/job-trace.json` to finish within 1hr 
+`file:///tmp/job-trace.json` to finish within 1hr 
 and use `file:///tmp/debug` as the temporary directory. 
 and use `file:///tmp/debug` as the temporary directory. 
 The intermediate files in the temporary directory will not be cleaned
 The intermediate files in the temporary directory will not be cleaned
 up.
 up.
       
       
 $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime with custom concentration.
 $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime with custom concentration.
 
 
-    java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m  -concentration 2  file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json</source>
+```
+java org.apache.hadoop.tools.rumen.Folder \
+  -output-duration 1h \
+  -input-cycle 20m \
+  -concentration 2 \
+  file:///tmp/job-trace.json \
+  file:///tmp/job-trace-1hr.json
+```
 
 
 This will fold the 10hr job-trace file 
 This will fold the 10hr job-trace file 
-`file:///home/user/job-trace.json` to finish within 1hr 
-with concentration of 2. `Example-2.3.2` will retain 10% 
-of the jobs. With *concentration* as 2, 20% of the total input 
-jobs will be retained.
+`file:///tmp/job-trace.json` to finish within 1hr 
+with concentration of 2.
+If the 10h job-trace is folded to 1h, it retains 10% of the jobs by default.
+With *concentration* as 2, 20% of the total input jobs will be retained.
 
 
 
 
 Appendix
 Appendix
@@ -377,21 +418,21 @@ $H3 Resources
 <a href="https://issues.apache.org/jira/browse/MAPREDUCE-751">MAPREDUCE-751</a>
 <a href="https://issues.apache.org/jira/browse/MAPREDUCE-751">MAPREDUCE-751</a>
 is the main JIRA that introduced *Rumen* to *MapReduce*. 
 is the main JIRA that introduced *Rumen* to *MapReduce*. 
 Look at the MapReduce
 Look at the MapReduce
-<a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313617">
-rumen-component</a>for further details.
+<a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313617">rumen-component</a>
+for further details.
 
 
 
 
 $H3 Dependencies
 $H3 Dependencies
 
 
-*Rumen* expects certain library *JARs* to be present in 
-the *CLASSPATH*. The required libraries are
-
-* `Hadoop MapReduce Tools` (`hadoop-mapred-tools-{hadoop-version}.jar`)
-* `Hadoop Common` (`hadoop-common-{hadoop-version}.jar`)
-* `Apache Commons Logging` (`commons-logging-1.1.1.jar`)
-* `Apache Commons CLI` (`commons-cli-1.2.jar`)
-* `Jackson Mapper` (`jackson-mapper-asl-1.4.2.jar`)
-* `Jackson Core` (`jackson-core-asl-1.4.2.jar`)
-
-> One simple way to run Rumen is to use '$HADOOP_HOME/bin/hadoop jar' 
-> option  to run it.
+*Rumen* expects certain library *JARs* to be present in  the *CLASSPATH*.
+One simple way to run Rumen is to use
+`hadoop jar` command to run it as example below.
+
+```
+$HADOOP_HOME/bin/hadoop jar \
+  $HADOOP_HOME/share/hadoop/tools/lib/hadoop-rumen-2.5.1.jar \
+  org.apache.hadoop.tools.rumen.TraceBuilder \
+  file:///tmp/job-trace.json \
+  file:///tmp/job-topology.json \
+  hdfs:///tmp/hadoop-yarn/staging/history/done_intermediate/testuser
+```