17 jaren geleden · cc34134077
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -777,6 +777,8 @@ Release 0.18.0 - Unreleased
 
				     HADOOP-3692. Fix documentation for Cluster setup and Quick start guides. 
			
 
				     (Amareshwari Sriramadasu via ddas)
			
 
				 
			
 
				+    HADOOP-3691. Fix streaming and tutorial docs. (Jothi Padmanabhan via ddas)
			
 
				+
			
 
				 Release 0.17.1 - Unreleased
			
 
				 
			
 
				   INCOMPATIBLE CHANGES
			
--- a/docs/changes.html
+++ b/docs/changes.html
@@ -378,7 +378,7 @@ InputFormat.validateInput.<br />(tomwhite via omalley)</li>
 
				     </ol>
			
 
				   </li>
			
 
				   <li><a href="javascript:toggleList('release_0.18.0_-_unreleased_._bug_fixes_')">  BUG FIXES
			
 
				-</a>&nbsp;&nbsp;&nbsp;(113)
			
 
				+</a>&nbsp;&nbsp;&nbsp;(115)
			
 
				     <ol id="release_0.18.0_-_unreleased_._bug_fixes_">
			
 
				       <li><a href="http://issues.apache.org/jira/browse/HADOOP-2905">HADOOP-2905</a>. 'fsck -move' triggers NPE in NameNode.<br />(Lohit Vjayarenu via rangadi)</li>
			
 
				       <li>Increment ClientProtocol.versionID missed by <a href="http://issues.apache.org/jira/browse/HADOOP-2585">HADOOP-2585</a>.<br />(shv)</li>
			
@@ -603,6 +603,8 @@ input. Validation job still runs on default fs.<br />(Jothi Padmanabhan via cdou
 
				 conform to style guidelines.<br />(Amareshwari Sriramadasu via cdouglas)</li>
			
 
				       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3653">HADOOP-3653</a>. Fix test-patch target to properly account for Eclipse
			
 
				 classpath jars.<br />(Brice Arnould via nigel)</li>
			
 
				+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3692">HADOOP-3692</a>. Fix documentation for Cluster setup and Quick start guides.<br />(Amareshwari Sriramadasu via ddas)</li>
			
 
				+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3691">HADOOP-3691</a>. Fix streaming and tutorial docs.<br />(Jothi Padmanabhan via ddas)</li>
			
 
				     </ol>
			
 
				   </li>
			
 
				 </ul>
			
--- a/docs/mapred_tutorial.html
+++ b/docs/mapred_tutorial.html
@@ -5,7 +5,7 @@
 
				 <meta content="Apache Forrest" name="Generator">
			
 
				 <meta name="Forrest-version" content="0.8">
			
 
				 <meta name="Forrest-skin-name" content="pelt">
			
 
				-<title>Hadoop Map-Reduce Tutorial</title>
			
 
				+<title>Hadoop Map/Reduce Tutorial</title>
			
 
				 <link type="text/css" href="skin/basic.css" rel="stylesheet">
			
 
				 <link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
			
 
				 <link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
			
@@ -187,7 +187,7 @@ document.write("Last Published: " + document.lastModified);
 
				 <a class="dida" href="mapred_tutorial.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br>
			
 
				         PDF</a>
			
 
				 </div>
			
 
				-<h1>Hadoop Map-Reduce Tutorial</h1>
			
 
				+<h1>Hadoop Map/Reduce Tutorial</h1>
			
 
				 <div id="minitoc-area">
			
 
				 <ul class="minitoc">
			
 
				 <li>
			
@@ -217,7 +217,7 @@ document.write("Last Published: " + document.lastModified);
 
				 </ul>
			
 
				 </li>
			
 
				 <li>
			
 
				-<a href="#Map-Reduce+-+User+Interfaces">Map-Reduce - User Interfaces</a>
			
 
				+<a href="#Map%2FReduce+-+User+Interfaces">Map/Reduce - User Interfaces</a>
			
 
				 <ul class="minitoc">
			
 
				 <li>
			
 
				 <a href="#Payload">Payload</a>
			
@@ -328,7 +328,7 @@ document.write("Last Published: " + document.lastModified);
 
				 <h2 class="h3">Purpose</h2>
			
 
				 <div class="section">
			
 
				 <p>This document comprehensively describes all user-facing facets of the 
			
 
				-      Hadoop Map-Reduce framework and serves as a tutorial.
			
 
				+      Hadoop Map/Reduce framework and serves as a tutorial.
			
 
				       </p>
			
 
				 </div>
			
 
				     
			
@@ -356,11 +356,11 @@ document.write("Last Published: " + document.lastModified);
 
				 <a name="N10032"></a><a name="Overview"></a>
			
 
				 <h2 class="h3">Overview</h2>
			
 
				 <div class="section">
			
 
				-<p>Hadoop Map-Reduce is a software framework for easily writing 
			
 
				+<p>Hadoop Map/Reduce is a software framework for easily writing 
			
 
				       applications which process vast amounts of data (multi-terabyte data-sets) 
			
 
				       in-parallel on large clusters (thousands of nodes) of commodity 
			
 
				       hardware in a reliable, fault-tolerant manner.</p>
			
 
				-<p>A Map-Reduce <em>job</em> usually splits the input data-set into 
			
 
				+<p>A Map/Reduce <em>job</em> usually splits the input data-set into 
			
 
				       independent chunks which are processed by the <em>map tasks</em> in a
			
 
				       completely parallel manner. The framework sorts the outputs of the maps, 
			
 
				       which are then input to the <em>reduce tasks</em>. Typically both the 
			
@@ -368,12 +368,12 @@ document.write("Last Published: " + document.lastModified);
 
				       takes care of scheduling tasks, monitoring them and re-executes the failed
			
 
				       tasks.</p>
			
 
				 <p>Typically the compute nodes and the storage nodes are the same, that is, 
			
 
				-      the Map-Reduce framework and the <a href="hdfs_design.html">Distributed 
			
 
				+      the Map/Reduce framework and the <a href="hdfs_design.html">Distributed 
			
 
				       FileSystem</a> are running on the same set of nodes. This configuration
			
 
				       allows the framework to effectively schedule tasks on the nodes where data 
			
 
				       is already present, resulting in very high aggregate bandwidth across the 
			
 
				       cluster.</p>
			
 
				-<p>The Map-Reduce framework consists of a single master 
			
 
				+<p>The Map/Reduce framework consists of a single master 
			
 
				       <span class="codefrag">JobTracker</span> and one slave <span class="codefrag">TaskTracker</span> per 
			
 
				       cluster-node. The master is responsible for scheduling the jobs' component 
			
 
				       tasks on the slaves, monitoring them and re-executing the failed tasks. The 
			
@@ -388,7 +388,7 @@ document.write("Last Published: " + document.lastModified);
 
				       scheduling tasks and monitoring them, providing status and diagnostic 
			
 
				       information to the job-client.</p>
			
 
				 <p>Although the Hadoop framework is implemented in Java<sup>TM</sup>, 
			
 
				-      Map-Reduce applications need not be written in Java.</p>
			
 
				+      Map/Reduce applications need not be written in Java.</p>
			
 
				 <ul>
			
 
				         
			
 
				 <li>
			
@@ -403,7 +403,7 @@ document.write("Last Published: " + document.lastModified);
 
				           
			
 
				 <a href="api/org/apache/hadoop/mapred/pipes/package-summary.html">
			
 
				           Hadoop Pipes</a> is a <a href="http://www.swig.org/">SWIG</a>-
			
 
				-          compatible <em>C++ API</em> to implement Map-Reduce applications (non 
			
 
				+          compatible <em>C++ API</em> to implement Map/Reduce applications (non 
			
 
				           JNI<sup>TM</sup> based).
			
 
				         </li>
			
 
				       
			
@@ -414,7 +414,7 @@ document.write("Last Published: " + document.lastModified);
 
				 <a name="N1008B"></a><a name="Inputs+and+Outputs"></a>
			
 
				 <h2 class="h3">Inputs and Outputs</h2>
			
 
				 <div class="section">
			
 
				-<p>The Map-Reduce framework operates exclusively on 
			
 
				+<p>The Map/Reduce framework operates exclusively on 
			
 
				       <span class="codefrag">&lt;key, value&gt;</span> pairs, that is, the framework views the 
			
 
				       input to the job as a set of <span class="codefrag">&lt;key, value&gt;</span> pairs and 
			
 
				       produces a set of <span class="codefrag">&lt;key, value&gt;</span> pairs as the output of 
			
@@ -426,7 +426,7 @@ document.write("Last Published: " + document.lastModified);
 
				       <a href="api/org/apache/hadoop/io/WritableComparable.html">
			
 
				       WritableComparable</a> interface to facilitate sorting by the framework.
			
 
				       </p>
			
 
				-<p>Input and Output types of a Map-Reduce job:</p>
			
 
				+<p>Input and Output types of a Map/Reduce job:</p>
			
 
				 <p>
			
 
				         (input) <span class="codefrag">&lt;k1, v1&gt;</span> 
			
 
				         -&gt; 
			
@@ -448,7 +448,7 @@ document.write("Last Published: " + document.lastModified);
 
				 <a name="N100CD"></a><a name="Example%3A+WordCount+v1.0"></a>
			
 
				 <h2 class="h3">Example: WordCount v1.0</h2>
			
 
				 <div class="section">
			
 
				-<p>Before we jump into the details, lets walk through an example Map-Reduce 
			
 
				+<p>Before we jump into the details, lets walk through an example Map/Reduce 
			
 
				       application to get a flavour for how they work.</p>
			
 
				 <p>
			
 
				 <span class="codefrag">WordCount</span> is a simple application that counts the number of
			
@@ -1226,11 +1226,11 @@ document.write("Last Published: " + document.lastModified);
 
				 </div>
			
 
				     
			
 
				     
			
 
				-<a name="N105A3"></a><a name="Map-Reduce+-+User+Interfaces"></a>
			
 
				-<h2 class="h3">Map-Reduce - User Interfaces</h2>
			
 
				+<a name="N105A3"></a><a name="Map%2FReduce+-+User+Interfaces"></a>
			
 
				+<h2 class="h3">Map/Reduce - User Interfaces</h2>
			
 
				 <div class="section">
			
 
				 <p>This section provides a reasonable amount of detail on every user-facing 
			
 
				-      aspect of the Map-Reduce framwork. This should help users implement, 
			
 
				+      aspect of the Map/Reduce framwork. This should help users implement, 
			
 
				       configure and tune their jobs in a fine-grained manner. However, please 
			
 
				       note that the javadoc for each class/interface remains the most 
			
 
				       comprehensive documentation available; this is only meant to be a tutorial.
			
@@ -1260,7 +1260,7 @@ document.write("Last Published: " + document.lastModified);
 
				           intermediate records. The transformed intermediate records do not need
			
 
				           to be of the same type as the input records. A given input pair may 
			
 
				           map to zero or many output pairs.</p>
			
 
				-<p>The Hadoop Map-Reduce framework spawns one map task for each 
			
 
				+<p>The Hadoop Map/Reduce framework spawns one map task for each 
			
 
				           <span class="codefrag">InputSplit</span> generated by the <span class="codefrag">InputFormat</span> for 
			
 
				           the job.</p>
			
 
				 <p>Overall, <span class="codefrag">Mapper</span> implementations are passed the 
			
@@ -1423,7 +1423,7 @@ document.write("Last Published: " + document.lastModified);
 
				 <h4>Reporter</h4>
			
 
				 <p>
			
 
				 <a href="api/org/apache/hadoop/mapred/Reporter.html">
			
 
				-          Reporter</a> is a facility for Map-Reduce applications to report 
			
 
				+          Reporter</a> is a facility for Map/Reduce applications to report 
			
 
				           progress, set application-level status messages and update 
			
 
				           <span class="codefrag">Counters</span>.</p>
			
 
				 <p>
			
@@ -1443,20 +1443,20 @@ document.write("Last Published: " + document.lastModified);
 
				 <p>
			
 
				 <a href="api/org/apache/hadoop/mapred/OutputCollector.html">
			
 
				           OutputCollector</a> is a generalization of the facility provided by
			
 
				-          the Map-Reduce framework to collect data output by the 
			
 
				+          the Map/Reduce framework to collect data output by the 
			
 
				           <span class="codefrag">Mapper</span> or the <span class="codefrag">Reducer</span> (either the 
			
 
				           intermediate outputs or the output of the job).</p>
			
 
				-<p>Hadoop Map-Reduce comes bundled with a 
			
 
				+<p>Hadoop Map/Reduce comes bundled with a 
			
 
				         <a href="api/org/apache/hadoop/mapred/lib/package-summary.html">
			
 
				         library</a> of generally useful mappers, reducers, and partitioners.</p>
			
 
				 <a name="N107B6"></a><a name="Job+Configuration"></a>
			
 
				 <h3 class="h4">Job Configuration</h3>
			
 
				 <p>
			
 
				 <a href="api/org/apache/hadoop/mapred/JobConf.html">
			
 
				-        JobConf</a> represents a Map-Reduce job configuration.</p>
			
 
				+        JobConf</a> represents a Map/Reduce job configuration.</p>
			
 
				 <p>
			
 
				 <span class="codefrag">JobConf</span> is the primary interface for a user to describe
			
 
				-        a map-reduce job to the Hadoop framework for execution. The framework 
			
 
				+        a Map/Reduce job to the Hadoop framework for execution. The framework 
			
 
				         tries to faithfully execute the job as described by <span class="codefrag">JobConf</span>, 
			
 
				         however:</p>
			
 
				 <ul>
			
@@ -1747,7 +1747,7 @@ document.write("Last Published: " + document.lastModified);
 
				         with the <span class="codefrag">JobTracker</span>.</p>
			
 
				 <p>
			
 
				 <span class="codefrag">JobClient</span> provides facilities to submit jobs, track their 
			
 
				-        progress, access component-tasks' reports/logs, get the Map-Reduce 
			
 
				+        progress, access component-tasks' reports and logs, get the Map/Reduce 
			
 
				         cluster's status information and so on.</p>
			
 
				 <p>The job submission process involves:</p>
			
 
				 <ol>
			
@@ -1762,7 +1762,7 @@ document.write("Last Published: " + document.lastModified);
 
				           </li>
			
 
				           
			
 
				 <li>
			
 
				-            Copying the job's jar and configuration to the map-reduce system 
			
 
				+            Copying the job's jar and configuration to the Map/Reduce system 
			
 
				             directory on the <span class="codefrag">FileSystem</span>.
			
 
				           </li>
			
 
				           
			
@@ -1802,8 +1802,8 @@ document.write("Last Published: " + document.lastModified);
 
				         <span class="codefrag">JobClient</span> to submit the job and monitor its progress.</p>
			
 
				 <a name="N10A48"></a><a name="Job+Control"></a>
			
 
				 <h4>Job Control</h4>
			
 
				-<p>Users may need to chain map-reduce jobs to accomplish complex
			
 
				-          tasks which cannot be done via a single map-reduce job. This is fairly
			
 
				+<p>Users may need to chain Map/Reduce jobs to accomplish complex
			
 
				+          tasks which cannot be done via a single Map/Reduce job. This is fairly
			
 
				           easy since the output of the job typically goes to distributed 
			
 
				           file-system, and the output, in turn, can be used as the input for the 
			
 
				           next job.</p>
			
@@ -1840,9 +1840,9 @@ document.write("Last Published: " + document.lastModified);
 
				 <h3 class="h4">Job Input</h3>
			
 
				 <p>
			
 
				 <a href="api/org/apache/hadoop/mapred/InputFormat.html">
			
 
				-        InputFormat</a> describes the input-specification for a Map-Reduce job.
			
 
				+        InputFormat</a> describes the input-specification for a Map/Reduce job.
			
 
				         </p>
			
 
				-<p>The Map-Reduce framework relies on the <span class="codefrag">InputFormat</span> of 
			
 
				+<p>The Map/Reduce framework relies on the <span class="codefrag">InputFormat</span> of 
			
 
				         the job to:</p>
			
 
				 <ol>
			
 
				           
			
@@ -1914,9 +1914,9 @@ document.write("Last Published: " + document.lastModified);
 
				 <h3 class="h4">Job Output</h3>
			
 
				 <p>
			
 
				 <a href="api/org/apache/hadoop/mapred/OutputFormat.html">
			
 
				-        OutputFormat</a> describes the output-specification for a Map-Reduce 
			
 
				+        OutputFormat</a> describes the output-specification for a Map/Reduce 
			
 
				         job.</p>
			
 
				-<p>The Map-Reduce framework relies on the <span class="codefrag">OutputFormat</span> of 
			
 
				+<p>The Map/Reduce framework relies on the <span class="codefrag">OutputFormat</span> of 
			
 
				         the job to:</p>
			
 
				 <ol>
			
 
				           
			
@@ -1946,7 +1946,7 @@ document.write("Last Published: " + document.lastModified);
 
				           application-writer will have to pick unique names per task-attempt 
			
 
				           (using the attemptid, say <span class="codefrag">attempt_200709221812_0001_m_000000_0</span>), 
			
 
				           not just per task.</p>
			
 
				-<p>To avoid these issues the Map-Reduce framework maintains a special 
			
 
				+<p>To avoid these issues the Map/Reduce framework maintains a special 
			
 
				           <span class="codefrag">${mapred.output.dir}/_temporary/_${taskid}</span> sub-directory
			
 
				           accessible via <span class="codefrag">${mapred.work.output.dir}</span>
			
 
				           for each task-attempt on the <span class="codefrag">FileSystem</span> where the output
			
@@ -1966,7 +1966,7 @@ document.write("Last Published: " + document.lastModified);
 
				 <p>Note: The value of <span class="codefrag">${mapred.work.output.dir}</span> during 
			
 
				           execution of a particular task-attempt is actually 
			
 
				           <span class="codefrag">${mapred.output.dir}/_temporary/_{$taskid}</span>, and this value is 
			
 
				-          set by the map-reduce framework. So, just create any side-files in the 
			
 
				+          set by the Map/Reduce framework. So, just create any side-files in the 
			
 
				           path  returned by
			
 
				           <a href="api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)">
			
 
				           FileOutputFormat.getWorkOutputPath() </a>from map/reduce 
			
@@ -1988,7 +1988,7 @@ document.write("Last Published: " + document.lastModified);
 
				 <h4>Counters</h4>
			
 
				 <p>
			
 
				 <span class="codefrag">Counters</span> represent global counters, defined either by 
			
 
				-          the Map-Reduce framework or applications. Each <span class="codefrag">Counter</span> can 
			
 
				+          the Map/Reduce framework or applications. Each <span class="codefrag">Counter</span> can 
			
 
				           be of any <span class="codefrag">Enum</span> type. Counters of a particular 
			
 
				           <span class="codefrag">Enum</span> are bunched into groups of type 
			
 
				           <span class="codefrag">Counters.Group</span>.</p>
			
@@ -2009,7 +2009,7 @@ document.write("Last Published: " + document.lastModified);
 
				           files efficiently.</p>
			
 
				 <p>
			
 
				 <span class="codefrag">DistributedCache</span> is a facility provided by the 
			
 
				-          Map-Reduce framework to cache files (text, archives, jars and so on) 
			
 
				+          Map/Reduce framework to cache files (text, archives, jars and so on) 
			
 
				           needed by applications.</p>
			
 
				 <p>Applications specify the files to be cached via urls (hdfs://)
			
 
				           in the <span class="codefrag">JobConf</span>. The <span class="codefrag">DistributedCache</span> 
			
@@ -2078,7 +2078,7 @@ document.write("Last Published: " + document.lastModified);
 
				           interface supports the handling of generic Hadoop command-line options.
			
 
				           </p>
			
 
				 <p>
			
 
				-<span class="codefrag">Tool</span> is the standard for any Map-Reduce tool or 
			
 
				+<span class="codefrag">Tool</span> is the standard for any Map/Reduce tool or 
			
 
				           application. The application should delegate the handling of 
			
 
				           standard command-line options to 
			
 
				           <a href="api/org/apache/hadoop/util/GenericOptionsParser.html">
			
@@ -2116,7 +2116,7 @@ document.write("Last Published: " + document.lastModified);
 
				 <h4>IsolationRunner</h4>
			
 
				 <p>
			
 
				 <a href="api/org/apache/hadoop/mapred/IsolationRunner.html">
			
 
				-          IsolationRunner</a> is a utility to help debug Map-Reduce programs.</p>
			
 
				+          IsolationRunner</a> is a utility to help debug Map/Reduce programs.</p>
			
 
				 <p>To use the <span class="codefrag">IsolationRunner</span>, first set 
			
 
				           <span class="codefrag">keep.failed.tasks.files</span> to <span class="codefrag">true</span> 
			
 
				           (also see <span class="codefrag">keep.tasks.files.pattern</span>).</p>
			
@@ -2219,11 +2219,11 @@ document.write("Last Published: " + document.lastModified);
 
				 <h4>JobControl</h4>
			
 
				 <p>
			
 
				 <a href="api/org/apache/hadoop/mapred/jobcontrol/package-summary.html">
			
 
				-          JobControl</a> is a utility which encapsulates a set of Map-Reduce jobs
			
 
				+          JobControl</a> is a utility which encapsulates a set of Map/Reduce jobs
			
 
				           and their dependencies.</p>
			
 
				 <a name="N10D57"></a><a name="Data+Compression"></a>
			
 
				 <h4>Data Compression</h4>
			
 
				-<p>Hadoop Map-Reduce provides facilities for the application-writer to
			
 
				+<p>Hadoop Map/Reduce provides facilities for the application-writer to
			
 
				           specify compression for both intermediate map-outputs and the
			
 
				           job-outputs i.e. output of the reduces. It also comes bundled with
			
 
				           <a href="api/org/apache/hadoop/io/compress/CompressionCodec.html">
			
@@ -2268,7 +2268,7 @@ document.write("Last Published: " + document.lastModified);
 
				 <h2 class="h3">Example: WordCount v2.0</h2>
			
 
				 <div class="section">
			
 
				 <p>Here is a more complete <span class="codefrag">WordCount</span> which uses many of the
			
 
				-      features provided by the Map-Reduce framework we discussed so far.</p>
			
 
				+      features provided by the Map/Reduce framework we discussed so far.</p>
			
 
				 <p>This needs the HDFS to be up and running, especially for the 
			
 
				       <span class="codefrag">DistributedCache</span>-related features. Hence it only works with a 
			
 
				       <a href="quickstart.html#SingleNodeSetup">pseudo-distributed</a> or
			
@@ -3655,7 +3655,7 @@ document.write("Last Published: " + document.lastModified);
 
				 <a name="N1160B"></a><a name="Highlights"></a>
			
 
				 <h3 class="h4">Highlights</h3>
			
 
				 <p>The second version of <span class="codefrag">WordCount</span> improves upon the 
			
 
				-        previous one by using some features offered by the Map-Reduce framework:
			
 
				+        previous one by using some features offered by the Map/Reduce framework:
			
 
				         </p>
			
 
				 <ul>
			
 
				           
			
--- a/docs/mapred_tutorial.pdf
+++ b/docs/mapred_tutorial.pdf
--- a/docs/streaming.html
+++ b/docs/streaming.html
@@ -287,7 +287,7 @@ document.write("Last Published: " + document.lastModified);
 
				 <h2 class="h3">Hadoop Streaming</h2>
			
 
				 <div class="section">
			
 
				 <p>
			
 
				-Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer. For example:
			
 
				+Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. For example:
			
 
				 </p>
			
 
				 <pre class="code">
			
 
				 $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
			
@@ -303,7 +303,7 @@ $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
 
				 <h2 class="h3">How Does Streaming Work </h2>
			
 
				 <div class="section">
			
 
				 <p>
			
 
				-In the above example, both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout. The utility will create a map/reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.
			
 
				+In the above example, both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout. The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.
			
 
				 </p>
			
 
				 <p>
			
 
				   When an executable is specified for mappers, each mapper task will launch the executable as a separate process when the mapper is initialized. As the mapper task runs, it converts its inputs into lines and feed the lines to the stdin of the process. In the meantime, the mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper. By default, the 
			
@@ -314,7 +314,7 @@ In the above example, both the mapper and the reducer are executables that read
 
				 When an executable is specified for reducers, each reducer task will launch the executable as a separate process then the reducer is initialized. As the reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the stdin of the process. In the meantime, the reducer collects the line oriented outputs from the stdout of the process, converts each line into a key/value pair, which is collected as the output of the reducer. By default, the prefix of a line up to the first tab character is the key and the the rest of the line (excluding the tab character) is the value. However, this can be customized, as discussed later.
			
 
				 </p>
			
 
				 <p>
			
 
				-This is the basis for the communication protocol between the map/reduce framework and the streaming mapper/reducer.
			
 
				+This is the basis for the communication protocol between the Map/Reduce framework and the streaming mapper/reducer.
			
 
				 </p>
			
 
				 <p>
			
 
				 You can supply a Java class as the mapper and/or the reducer. The above example is equivalent to:
			
@@ -372,7 +372,7 @@ $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
 
				 <a name="N10077"></a><a name="Mapper-Only+Jobs"></a>
			
 
				 <h3 class="h4">Mapper-Only Jobs </h3>
			
 
				 <p>
			
 
				-Often, you may want to process input data using a map function only. To do this, simply set mapred.reduce.tasks to zero. The map/reduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.
			
 
				+Often, you may want to process input data using a map function only. To do this, simply set mapred.reduce.tasks to zero. The Map/Reduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.
			
 
				 </p>
			
 
				 <p>
			
 
				 To be backward compatible, Hadoop Streaming also supports the "-reduce NONE" option, which is equivalent to "-jobconf mapred.reduce.tasks=0".
			
@@ -380,7 +380,7 @@ To be backward compatible, Hadoop Streaming also supports the "-reduce NONE" opt
 
				 <a name="N10083"></a><a name="Specifying+Other+Plugins+for+Jobs"></a>
			
 
				 <h3 class="h4">Specifying Other Plugins for Jobs </h3>
			
 
				 <p>
			
 
				-Just as with a normal map/reduce job, you can specify other plugins for a streaming job:
			
 
				+Just as with a normal Map/Reduce job, you can specify other plugins for a streaming job:
			
 
				 </p>
			
 
				 <pre class="code">
			
 
				    -inputformat JavaClassName
			
@@ -500,7 +500,7 @@ Other options you may specify for a streaming job are described here:
 
				 
			
 
				 
			
 
				 <tr>
			
 
				-<td colspan="1" rowspan="1"> -dfs  host:port or local </td><td colspan="1" rowspan="1"> Optional </td><td colspan="1" rowspan="1"> Override the DFS configuration for the job </td>
			
 
				+<td colspan="1" rowspan="1"> -dfs  host:port or local </td><td colspan="1" rowspan="1"> Optional </td><td colspan="1" rowspan="1"> Override the HDFS configuration for the job </td>
			
 
				 </tr>
			
 
				 
			
 
				 <tr>
			
@@ -571,7 +571,7 @@ To set an environment variable in a streaming command use:
 
				 <a name="N10194"></a><a name="Customizing+the+Way+to+Split+Lines+into+Key%2FValue+Pairs"></a>
			
 
				 <h3 class="h4">Customizing the Way to Split Lines into Key/Value Pairs </h3>
			
 
				 <p>
			
 
				-As noted earlier, when the map/reduce framework reads a line from the stdout of the mapper, it splits the line into a key/value pair. By default, the prefix of the line up to the first tab character is the key and the the rest of the line (excluding the tab character) is the value.
			
 
				+As noted earlier, when the Map/Reduce framework reads a line from the stdout of the mapper, it splits the line into a key/value pair. By default, the prefix of the line up to the first tab character is the key and the the rest of the line (excluding the tab character) is the value.
			
 
				 </p>
			
 
				 <p>
			
 
				 However, you can customize this default. You can specify a field separator other than the tab character (the default), and you can specify the nth (n &gt;= 1) character rather than the first character in a line (the default) as the separator between the key and value. For example:
			
@@ -594,7 +594,7 @@ Similarly, you can use "-jobconf stream.reduce.output.field.separator=SEP" and "
 
				 <a name="N101AA"></a><a name="A+Useful+Partitioner+Class+%28secondary+sort%2C+the+-partitioner+org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner+option%29"></a>
			
 
				 <h3 class="h4">A Useful Partitioner Class (secondary sort, the -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner option) </h3>
			
 
				 <p>
			
 
				-Hadoop has a library class, org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner, that is useful for many applications. This class allows the map/reduce framework to partition the map outputs based on prefixes of keys, not the whole keys. For example:
			
 
				+Hadoop has a library class, org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner, that is useful for many applications. This class allows the Map/Reduce framework to partition the map outputs based on prefixes of keys, not the whole keys. For example:
			
 
				 </p>
			
 
				 <pre class="code">
			
 
				 $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
			
@@ -613,7 +613,7 @@ $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
 
				 Here, <em>-jobconf stream.map.output.field.separator=.</em> and <em>-jobconf stream.num.map.output.key.fields=4</em> are as explained in previous example. The two variables are used by streaming to identify the key/value pair of mapper. 
			
 
				 </p>
			
 
				 <p>
			
 
				-The map output keys of the above map/reduce job normally have four fields separated by ".". However, the map/reduce framework will partition the map outputs by the first two fields of the keys using the <em>-jobconf num.key.fields.for.partition=2</em> option. Here, <em>-jobconf map.output.key.field.separator=.</em> specifies the separator for the partition. This guarantees that all the key/value pairs with the same first two fields in the keys will be partitioned into the same reducer.
			
 
				+The map output keys of the above Map/Reduce job normally have four fields separated by ".". However, the Map/Reduce framework will partition the map outputs by the first two fields of the keys using the <em>-jobconf num.key.fields.for.partition=2</em> option. Here, <em>-jobconf map.output.key.field.separator=.</em> specifies the separator for the partition. This guarantees that all the key/value pairs with the same first two fields in the keys will be partitioned into the same reducer.
			
 
				 </p>
			
 
				 <p>
			
 
				 
			
@@ -746,7 +746,7 @@ As an example, consider the problem of zipping (compressing) a set of files acro
 
				 
			
 
				 <li> Hadoop Streaming and custom mapper script:<ul>
			
 
				   
			
 
				-<li> Generate a file containing the full DFS path of the input files. Each map task would get one file name as input.</li>
			
 
				+<li> Generate a file containing the full HDFS path of the input files. Each map task would get one file name as input.</li>
			
 
				   
			
 
				 <li> Create a mapper script which, given a filename, will get the file to local disk, gzip the file and put it back in the desired output directory</li>
			
 
				 
			
--- a/docs/streaming.pdf
+++ b/docs/streaming.pdf
--- a/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml
+++ b/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml
@@ -20,7 +20,7 @@
 
				 <document>
			
 
				   
			
 
				   <header>
			
 
				-    <title>Hadoop Map-Reduce Tutorial</title>
			
 
				+    <title>Hadoop Map/Reduce Tutorial</title>
			
 
				   </header>
			
 
				   
			
 
				   <body>
			
@@ -29,7 +29,7 @@
 
				       <title>Purpose</title>
			
 
				       
			
 
				       <p>This document comprehensively describes all user-facing facets of the 
			
 
				-      Hadoop Map-Reduce framework and serves as a tutorial.
			
 
				+      Hadoop Map/Reduce framework and serves as a tutorial.
			
 
				       </p>
			
 
				     </section>
			
 
				     
			
@@ -52,12 +52,12 @@
 
				     <section>
			
 
				       <title>Overview</title>
			
 
				       
			
 
				-      <p>Hadoop Map-Reduce is a software framework for easily writing 
			
 
				+      <p>Hadoop Map/Reduce is a software framework for easily writing 
			
 
				       applications which process vast amounts of data (multi-terabyte data-sets) 
			
 
				       in-parallel on large clusters (thousands of nodes) of commodity 
			
 
				       hardware in a reliable, fault-tolerant manner.</p>
			
 
				       
			
 
				-      <p>A Map-Reduce <em>job</em> usually splits the input data-set into 
			
 
				+      <p>A Map/Reduce <em>job</em> usually splits the input data-set into 
			
 
				       independent chunks which are processed by the <em>map tasks</em> in a
			
 
				       completely parallel manner. The framework sorts the outputs of the maps, 
			
 
				       which are then input to the <em>reduce tasks</em>. Typically both the 
			
@@ -66,13 +66,13 @@
 
				       tasks.</p>
			
 
				       
			
 
				       <p>Typically the compute nodes and the storage nodes are the same, that is, 
			
 
				-      the Map-Reduce framework and the <a href="hdfs_design.html">Distributed 
			
 
				+      the Map/Reduce framework and the <a href="hdfs_design.html">Distributed 
			
 
				       FileSystem</a> are running on the same set of nodes. This configuration
			
 
				       allows the framework to effectively schedule tasks on the nodes where data 
			
 
				       is already present, resulting in very high aggregate bandwidth across the 
			
 
				       cluster.</p>
			
 
				       
			
 
				-      <p>The Map-Reduce framework consists of a single master 
			
 
				+      <p>The Map/Reduce framework consists of a single master 
			
 
				       <code>JobTracker</code> and one slave <code>TaskTracker</code> per 
			
 
				       cluster-node. The master is responsible for scheduling the jobs' component 
			
 
				       tasks on the slaves, monitoring them and re-executing the failed tasks. The 
			
@@ -89,7 +89,7 @@
 
				       information to the job-client.</p>
			
 
				       
			
 
				       <p>Although the Hadoop framework is implemented in Java<sup>TM</sup>, 
			
 
				-      Map-Reduce applications need not be written in Java.</p>
			
 
				+      Map/Reduce applications need not be written in Java.</p>
			
 
				       <ul>
			
 
				         <li>
			
 
				           <a href="ext:api/org/apache/hadoop/streaming/package-summary">
			
@@ -100,7 +100,7 @@
 
				         <li>
			
 
				           <a href="ext:api/org/apache/hadoop/mapred/pipes/package-summary">
			
 
				           Hadoop Pipes</a> is a <a href="http://www.swig.org/">SWIG</a>-
			
 
				-          compatible <em>C++ API</em> to implement Map-Reduce applications (non 
			
 
				+          compatible <em>C++ API</em> to implement Map/Reduce applications (non 
			
 
				           JNI<sup>TM</sup> based).
			
 
				         </li>
			
 
				       </ul>
			
@@ -109,7 +109,7 @@
 
				     <section>
			
 
				       <title>Inputs and Outputs</title>
			
 
				 
			
 
				-      <p>The Map-Reduce framework operates exclusively on 
			
 
				+      <p>The Map/Reduce framework operates exclusively on 
			
 
				       <code>&lt;key, value&gt;</code> pairs, that is, the framework views the 
			
 
				       input to the job as a set of <code>&lt;key, value&gt;</code> pairs and 
			
 
				       produces a set of <code>&lt;key, value&gt;</code> pairs as the output of 
			
@@ -123,7 +123,7 @@
 
				       WritableComparable</a> interface to facilitate sorting by the framework.
			
 
				       </p>
			
 
				 
			
 
				-      <p>Input and Output types of a Map-Reduce job:</p>
			
 
				+      <p>Input and Output types of a Map/Reduce job:</p>
			
 
				       <p>
			
 
				         (input) <code>&lt;k1, v1&gt;</code> 
			
 
				         -&gt; 
			
@@ -144,7 +144,7 @@
 
				     <section>
			
 
				       <title>Example: WordCount v1.0</title>
			
 
				       
			
 
				-      <p>Before we jump into the details, lets walk through an example Map-Reduce 
			
 
				+      <p>Before we jump into the details, lets walk through an example Map/Reduce 
			
 
				       application to get a flavour for how they work.</p>
			
 
				       
			
 
				       <p><code>WordCount</code> is a simple application that counts the number of
			
@@ -683,10 +683,10 @@
 
				     </section>
			
 
				     
			
 
				     <section>
			
 
				-      <title>Map-Reduce - User Interfaces</title>
			
 
				+      <title>Map/Reduce - User Interfaces</title>
			
 
				       
			
 
				       <p>This section provides a reasonable amount of detail on every user-facing 
			
 
				-      aspect of the Map-Reduce framwork. This should help users implement, 
			
 
				+      aspect of the Map/Reduce framwork. This should help users implement, 
			
 
				       configure and tune their jobs in a fine-grained manner. However, please 
			
 
				       note that the javadoc for each class/interface remains the most 
			
 
				       comprehensive documentation available; this is only meant to be a tutorial.
			
@@ -724,7 +724,7 @@
 
				           to be of the same type as the input records. A given input pair may 
			
 
				           map to zero or many output pairs.</p> 
			
 
				  
			
 
				-          <p>The Hadoop Map-Reduce framework spawns one map task for each 
			
 
				+          <p>The Hadoop Map/Reduce framework spawns one map task for each 
			
 
				           <code>InputSplit</code> generated by the <code>InputFormat</code> for 
			
 
				           the job.</p>
			
 
				           
			
@@ -935,7 +935,7 @@
 
				           <title>Reporter</title>
			
 
				         
			
 
				           <p><a href="ext:api/org/apache/hadoop/mapred/reporter">
			
 
				-          Reporter</a> is a facility for Map-Reduce applications to report 
			
 
				+          Reporter</a> is a facility for Map/Reduce applications to report 
			
 
				           progress, set application-level status messages and update 
			
 
				           <code>Counters</code>.</p>
			
 
				  
			
@@ -958,12 +958,12 @@
 
				         
			
 
				           <p><a href="ext:api/org/apache/hadoop/mapred/outputcollector">
			
 
				           OutputCollector</a> is a generalization of the facility provided by
			
 
				-          the Map-Reduce framework to collect data output by the 
			
 
				+          the Map/Reduce framework to collect data output by the 
			
 
				           <code>Mapper</code> or the <code>Reducer</code> (either the 
			
 
				           intermediate outputs or the output of the job).</p>
			
 
				         </section>
			
 
				       
			
 
				-        <p>Hadoop Map-Reduce comes bundled with a 
			
 
				+        <p>Hadoop Map/Reduce comes bundled with a 
			
 
				         <a href="ext:api/org/apache/hadoop/mapred/lib/package-summary">
			
 
				         library</a> of generally useful mappers, reducers, and partitioners.</p>
			
 
				       </section>
			
@@ -972,10 +972,10 @@
 
				         <title>Job Configuration</title>
			
 
				         
			
 
				         <p><a href="ext:api/org/apache/hadoop/mapred/jobconf">
			
 
				-        JobConf</a> represents a Map-Reduce job configuration.</p>
			
 
				+        JobConf</a> represents a Map/Reduce job configuration.</p>
			
 
				  
			
 
				         <p><code>JobConf</code> is the primary interface for a user to describe
			
 
				-        a map-reduce job to the Hadoop framework for execution. The framework 
			
 
				+        a Map/Reduce job to the Hadoop framework for execution. The framework 
			
 
				         tries to faithfully execute the job as described by <code>JobConf</code>, 
			
 
				         however:</p> 
			
 
				         <ul>
			
@@ -1204,7 +1204,7 @@
 
				         with the <code>JobTracker</code>.</p>
			
 
				  
			
 
				         <p><code>JobClient</code> provides facilities to submit jobs, track their 
			
 
				-        progress, access component-tasks' reports/logs, get the Map-Reduce 
			
 
				+        progress, access component-tasks' reports and logs, get the Map/Reduce 
			
 
				         cluster's status information and so on.</p>
			
 
				  
			
 
				         <p>The job submission process involves:</p>
			
@@ -1216,7 +1216,7 @@
 
				             <code>DistributedCache</code> of the job, if necessary.
			
 
				           </li>
			
 
				           <li>
			
 
				-            Copying the job's jar and configuration to the map-reduce system 
			
 
				+            Copying the job's jar and configuration to the Map/Reduce system 
			
 
				             directory on the <code>FileSystem</code>.
			
 
				           </li>
			
 
				           <li>
			
@@ -1253,8 +1253,8 @@
 
				         <section>
			
 
				           <title>Job Control</title>
			
 
				  
			
 
				-          <p>Users may need to chain map-reduce jobs to accomplish complex
			
 
				-          tasks which cannot be done via a single map-reduce job. This is fairly
			
 
				+          <p>Users may need to chain Map/Reduce jobs to accomplish complex
			
 
				+          tasks which cannot be done via a single Map/Reduce job. This is fairly
			
 
				           easy since the output of the job typically goes to distributed 
			
 
				           file-system, and the output, in turn, can be used as the input for the 
			
 
				           next job.</p>
			
@@ -1288,10 +1288,10 @@
 
				         <title>Job Input</title>
			
 
				         
			
 
				         <p><a href="ext:api/org/apache/hadoop/mapred/inputformat">
			
 
				-        InputFormat</a> describes the input-specification for a Map-Reduce job.
			
 
				+        InputFormat</a> describes the input-specification for a Map/Reduce job.
			
 
				         </p> 
			
 
				  
			
 
				-        <p>The Map-Reduce framework relies on the <code>InputFormat</code> of 
			
 
				+        <p>The Map/Reduce framework relies on the <code>InputFormat</code> of 
			
 
				         the job to:</p>
			
 
				         <ol>
			
 
				           <li>Validate the input-specification of the job.</li>
			
@@ -1370,10 +1370,10 @@
 
				         <title>Job Output</title>
			
 
				         
			
 
				         <p><a href="ext:api/org/apache/hadoop/mapred/outputformat">
			
 
				-        OutputFormat</a> describes the output-specification for a Map-Reduce 
			
 
				+        OutputFormat</a> describes the output-specification for a Map/Reduce 
			
 
				         job.</p>
			
 
				 
			
 
				-        <p>The Map-Reduce framework relies on the <code>OutputFormat</code> of 
			
 
				+        <p>The Map/Reduce framework relies on the <code>OutputFormat</code> of 
			
 
				         the job to:</p>
			
 
				         <ol>
			
 
				           <li>
			
@@ -1404,7 +1404,7 @@
 
				           (using the attemptid, say <code>attempt_200709221812_0001_m_000000_0</code>), 
			
 
				           not just per task.</p> 
			
 
				  
			
 
				-          <p>To avoid these issues the Map-Reduce framework maintains a special 
			
 
				+          <p>To avoid these issues the Map/Reduce framework maintains a special 
			
 
				           <code>${mapred.output.dir}/_temporary/_${taskid}</code> sub-directory
			
 
				           accessible via <code>${mapred.work.output.dir}</code>
			
 
				           for each task-attempt on the <code>FileSystem</code> where the output
			
@@ -1426,7 +1426,7 @@
 
				           <p>Note: The value of <code>${mapred.work.output.dir}</code> during 
			
 
				           execution of a particular task-attempt is actually 
			
 
				           <code>${mapred.output.dir}/_temporary/_{$taskid}</code>, and this value is 
			
 
				-          set by the map-reduce framework. So, just create any side-files in the 
			
 
				+          set by the Map/Reduce framework. So, just create any side-files in the 
			
 
				           path  returned by
			
 
				           <a href="ext:api/org/apache/hadoop/mapred/fileoutputformat/getworkoutputpath">
			
 
				           FileOutputFormat.getWorkOutputPath() </a>from map/reduce 
			
@@ -1456,7 +1456,7 @@
 
				           <title>Counters</title>
			
 
				           
			
 
				           <p><code>Counters</code> represent global counters, defined either by 
			
 
				-          the Map-Reduce framework or applications. Each <code>Counter</code> can 
			
 
				+          the Map/Reduce framework or applications. Each <code>Counter</code> can 
			
 
				           be of any <code>Enum</code> type. Counters of a particular 
			
 
				           <code>Enum</code> are bunched into groups of type 
			
 
				           <code>Counters.Group</code>.</p>
			
@@ -1480,7 +1480,7 @@
 
				           files efficiently.</p>
			
 
				  
			
 
				           <p><code>DistributedCache</code> is a facility provided by the 
			
 
				-          Map-Reduce framework to cache files (text, archives, jars and so on) 
			
 
				+          Map/Reduce framework to cache files (text, archives, jars and so on) 
			
 
				           needed by applications.</p>
			
 
				  
			
 
				           <p>Applications specify the files to be cached via urls (hdfs://)
			
@@ -1558,7 +1558,7 @@
 
				           interface supports the handling of generic Hadoop command-line options.
			
 
				           </p>
			
 
				           
			
 
				-          <p><code>Tool</code> is the standard for any Map-Reduce tool or 
			
 
				+          <p><code>Tool</code> is the standard for any Map/Reduce tool or 
			
 
				           application. The application should delegate the handling of 
			
 
				           standard command-line options to 
			
 
				           <a href="ext:api/org/apache/hadoop/util/genericoptionsparser">
			
@@ -1591,7 +1591,7 @@
 
				           <title>IsolationRunner</title>
			
 
				           
			
 
				           <p><a href="ext:api/org/apache/hadoop/mapred/isolationrunner">
			
 
				-          IsolationRunner</a> is a utility to help debug Map-Reduce programs.</p>
			
 
				+          IsolationRunner</a> is a utility to help debug Map/Reduce programs.</p>
			
 
				           
			
 
				           <p>To use the <code>IsolationRunner</code>, first set 
			
 
				           <code>keep.failed.tasks.files</code> to <code>true</code> 
			
@@ -1703,14 +1703,14 @@
 
				           <title>JobControl</title>
			
 
				           
			
 
				           <p><a href="ext:api/org/apache/hadoop/mapred/jobcontrol/package-summary">
			
 
				-          JobControl</a> is a utility which encapsulates a set of Map-Reduce jobs
			
 
				+          JobControl</a> is a utility which encapsulates a set of Map/Reduce jobs
			
 
				           and their dependencies.</p>
			
 
				         </section>
			
 
				         
			
 
				         <section>
			
 
				           <title>Data Compression</title>
			
 
				           
			
 
				-          <p>Hadoop Map-Reduce provides facilities for the application-writer to
			
 
				+          <p>Hadoop Map/Reduce provides facilities for the application-writer to
			
 
				           specify compression for both intermediate map-outputs and the
			
 
				           job-outputs i.e. output of the reduces. It also comes bundled with
			
 
				           <a href="ext:api/org/apache/hadoop/io/compress/compressioncodec">
			
@@ -1765,7 +1765,7 @@
 
				       <title>Example: WordCount v2.0</title>
			
 
				       
			
 
				       <p>Here is a more complete <code>WordCount</code> which uses many of the
			
 
				-      features provided by the Map-Reduce framework we discussed so far.</p>
			
 
				+      features provided by the Map/Reduce framework we discussed so far.</p>
			
 
				       
			
 
				       <p>This needs the HDFS to be up and running, especially for the 
			
 
				       <code>DistributedCache</code>-related features. Hence it only works with a 
			
@@ -2717,7 +2717,7 @@
 
				         <title>Highlights</title>
			
 
				         
			
 
				         <p>The second version of <code>WordCount</code> improves upon the 
			
 
				-        previous one by using some features offered by the Map-Reduce framework:
			
 
				+        previous one by using some features offered by the Map/Reduce framework:
			
 
				         </p>
			
 
				         <ul>
			
 
				           <li>
			
--- a/src/docs/src/documentation/content/xdocs/streaming.xml
+++ b/src/docs/src/documentation/content/xdocs/streaming.xml
@@ -31,7 +31,7 @@
 
				 <title>Hadoop Streaming</title>

			
 
				 

			
 
				 <p>

			
 
				-Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer. For example:

			
 
				+Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. For example:

			
 
				 </p>

			
 
				 <source>

			
 
				 $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \

			
@@ -45,7 +45,7 @@ $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
 
				 <section>

			
 
				 <title>How Does Streaming Work </title>

			
 
				 <p>

			
 
				-In the above example, both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout. The utility will create a map/reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.

			
 
				+In the above example, both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout. The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.

			
 
				 </p><p>

			
 
				   When an executable is specified for mappers, each mapper task will launch the executable as a separate process when the mapper is initialized. As the mapper task runs, it converts its inputs into lines and feed the lines to the stdin of the process. In the meantime, the mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper. By default, the 

			
 
				   <em>prefix of a line up to the first tab character</em> is the <strong>key</strong> and the the rest of the line (excluding the tab character) will be the <strong>value</strong>. 

			
@@ -54,7 +54,7 @@ In the above example, both the mapper and the reducer are executables that read
 
				 <p>

			
 
				 When an executable is specified for reducers, each reducer task will launch the executable as a separate process then the reducer is initialized. As the reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the stdin of the process. In the meantime, the reducer collects the line oriented outputs from the stdout of the process, converts each line into a key/value pair, which is collected as the output of the reducer. By default, the prefix of a line up to the first tab character is the key and the the rest of the line (excluding the tab character) is the value. However, this can be customized, as discussed later.

			
 
				 </p><p>

			
 
				-This is the basis for the communication protocol between the map/reduce framework and the streaming mapper/reducer.

			
 
				+This is the basis for the communication protocol between the Map/Reduce framework and the streaming mapper/reducer.

			
 
				 </p><p>

			
 
				 You can supply a Java class as the mapper and/or the reducer. The above example is equivalent to:

			
 
				 </p>

			
@@ -109,7 +109,7 @@ $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
 
				 <section>

			
 
				 <title>Mapper-Only Jobs </title>

			
 
				 <p>

			
 
				-Often, you may want to process input data using a map function only. To do this, simply set mapred.reduce.tasks to zero. The map/reduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.

			
 
				+Often, you may want to process input data using a map function only. To do this, simply set mapred.reduce.tasks to zero. The Map/Reduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.

			
 
				 </p><p>

			
 
				 To be backward compatible, Hadoop Streaming also supports the "-reduce NONE" option, which is equivalent to "-jobconf mapred.reduce.tasks=0".

			
 
				 </p>

			
@@ -118,7 +118,7 @@ To be backward compatible, Hadoop Streaming also supports the "-reduce NONE" opt
 
				 <section>

			
 
				 <title>Specifying Other Plugins for Jobs </title>

			
 
				 <p>

			
 
				-Just as with a normal map/reduce job, you can specify other plugins for a streaming job:

			
 
				+Just as with a normal Map/Reduce job, you can specify other plugins for a streaming job:

			
 
				 </p>

			
 
				 <source>

			
 
				    -inputformat JavaClassName

			
@@ -235,7 +235,7 @@ Other options you may specify for a streaming job are described here:
 
				 <tr><th>Parameter</th><th>Optional/Required </th><th>Description </th></tr>

			
 
				 <tr><td> -cluster name </td><td> Optional </td><td> Switch between local Hadoop and one or more remote clusters </td></tr>

			
 
				 

			
 
				-<tr><td> -dfs  host:port or local </td><td> Optional </td><td> Override the DFS configuration for the job </td></tr>

			
 
				+<tr><td> -dfs  host:port or local </td><td> Optional </td><td> Override the HDFS configuration for the job </td></tr>

			
 
				 <tr><td> -jt host:port or local </td><td> Optional </td><td> Override the JobTracker configuration for the job </td></tr>

			
 
				 <tr><td> -additionalconfspec specfile </td><td> Optional </td><td> Specifies a set of configuration variables in an XML file like hadoop-site.xml, instead of using multiple  options of type "-jobconf name=value" </td></tr>

			
 
				 

			
@@ -282,7 +282,7 @@ To set an environment variable in a streaming command use:
 
				 <section>

			
 
				 <title>Customizing the Way to Split Lines into Key/Value Pairs </title>

			
 
				 <p>

			
 
				-As noted earlier, when the map/reduce framework reads a line from the stdout of the mapper, it splits the line into a key/value pair. By default, the prefix of the line up to the first tab character is the key and the the rest of the line (excluding the tab character) is the value.

			
 
				+As noted earlier, when the Map/Reduce framework reads a line from the stdout of the mapper, it splits the line into a key/value pair. By default, the prefix of the line up to the first tab character is the key and the the rest of the line (excluding the tab character) is the value.

			
 
				 </p>

			
 
				 <p>

			
 
				 However, you can customize this default. You can specify a field separator other than the tab character (the default), and you can specify the nth (n >= 1) character rather than the first character in a line (the default) as the separator between the key and value. For example:

			
@@ -308,7 +308,7 @@ Similarly, you can use "-jobconf stream.reduce.output.field.separator=SEP" and "
 
				 <section>

			
 
				 <title>A Useful Partitioner Class (secondary sort, the -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner option) </title>

			
 
				 <p>

			
 
				-Hadoop has a library class, org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner, that is useful for many applications. This class allows the map/reduce framework to partition the map outputs based on prefixes of keys, not the whole keys. For example:

			
 
				+Hadoop has a library class, org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner, that is useful for many applications. This class allows the Map/Reduce framework to partition the map outputs based on prefixes of keys, not the whole keys. For example:

			
 
				 </p>

			
 
				 <source>

			
 
				 $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \

			
@@ -326,7 +326,7 @@ $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
 
				 <p>

			
 
				 Here, <em>-jobconf stream.map.output.field.separator=.</em> and <em>-jobconf stream.num.map.output.key.fields=4</em> are as explained in previous example. The two variables are used by streaming to identify the key/value pair of mapper. 

			
 
				 </p><p>

			
 
				-The map output keys of the above map/reduce job normally have four fields separated by ".". However, the map/reduce framework will partition the map outputs by the first two fields of the keys using the <em>-jobconf num.key.fields.for.partition=2</em> option. Here, <em>-jobconf map.output.key.field.separator=.</em> specifies the separator for the partition. This guarantees that all the key/value pairs with the same first two fields in the keys will be partitioned into the same reducer.

			
 
				+The map output keys of the above Map/Reduce job normally have four fields separated by ".". However, the Map/Reduce framework will partition the map outputs by the first two fields of the keys using the <em>-jobconf num.key.fields.for.partition=2</em> option. Here, <em>-jobconf map.output.key.field.separator=.</em> specifies the separator for the partition. This guarantees that all the key/value pairs with the same first two fields in the keys will be partitioned into the same reducer.

			
 
				 </p><p>

			
 
				 <em>This is effectively equivalent to specifying the first two fields as the primary key and the next two fields as the secondary. The primary key is used for partitioning, and the combination of the primary and secondary keys is used for sorting.</em> A simple illustration is shown here:

			
 
				 </p>

			
@@ -456,7 +456,7 @@ Often you do not need the full power of Map Reduce, but only need to run multipl
 
				 As an example, consider the problem of zipping (compressing) a set of files across the hadoop cluster. You can achieve this using either of these methods:

			
 
				 </p><ol>

			
 
				 <li> Hadoop Streaming and custom mapper script:<ul>

			
 
				-  <li> Generate a file containing the full DFS path of the input files. Each map task would get one file name as input.</li>

			
 
				+  <li> Generate a file containing the full HDFS path of the input files. Each map task would get one file name as input.</li>

			
 
				   <li> Create a mapper script which, given a filename, will get the file to local disk, gzip the file and put it back in the desired output directory</li>

			
 
				 </ul></li>

			
 
				 <li>The existing Hadoop Framework:<ul>