Browse Source

HADOOP-3593. Updates the mapred tutorial. Contributed by Devaraj Das.

git-svn-id: https://svn.apache.org/repos/asf/hadoop/core/trunk@669446 13f79535-47bb-0310-9956-ffa450edef68
Devaraj Das 17 years ago
parent
commit
5c96dceff7

+ 2 - 0
CHANGES.txt

@@ -310,6 +310,8 @@ Release 0.18.0 - Unreleased
     HADOOP-3535. Fix documentation and name of IOUtils.close to
     reflect that it should only be used in cleanup contexts. (omalley)
 
+    HADOOP-3593. Updates the mapred tutorial. (ddas)
+
   OPTIMIZATIONS
 
     HADOOP-3274. The default constructor of BytesWritable creates empty 

+ 58 - 16
docs/changes.html

@@ -76,8 +76,10 @@
     </ol>
   </li>
   <li><a href="javascript:toggleList('trunk_(unreleased_changes)_._bug_fixes_')">  BUG FIXES
-</a>&nbsp;&nbsp;&nbsp;(none)
+</a>&nbsp;&nbsp;&nbsp;(1)
     <ol id="trunk_(unreleased_changes)_._bug_fixes_">
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3563">HADOOP-3563</a>.  Refactor the distributed upgrade code so that it is
+easier to identify datanode and namenode related code.<br />(dhruba)</li>
     </ol>
   </li>
 </ul>
@@ -207,7 +209,7 @@ framework.<br />(tomwhite via omalley)</li>
     </ol>
   </li>
   <li><a href="javascript:toggleList('release_0.18.0_-_unreleased_._improvements_')">  IMPROVEMENTS
-</a>&nbsp;&nbsp;&nbsp;(39)
+</a>&nbsp;&nbsp;&nbsp;(41)
     <ol id="release_0.18.0_-_unreleased_._improvements_">
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-2928">HADOOP-2928</a>. Remove deprecated FileSystem.getContentLength().<br />(Lohit Vjayarenu via rangadi)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3130">HADOOP-3130</a>. Make the connect timeout smaller for getFile.<br />(Amar Ramesh Kamat via ddas)</li>
@@ -290,17 +292,18 @@ the Map-Reduce tutorial.<br />(Amareshwari Sriramadasu via ddas)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3406">HADOOP-3406</a>. Add forrest documentation for Profiling.<br />(Amareshwari Sriramadasu via ddas)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-2762">HADOOP-2762</a>. Add forrest documentation for controls of memory limits on
 hadoop daemons and Map-Reduce tasks.<br />(Amareshwari Sriramadasu via ddas)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3535">HADOOP-3535</a>. Fix documentation and name of IOUtils.close to
+reflect that it should only be used in cleanup contexts.<br />(omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3593">HADOOP-3593</a>. Updates the mapred tutorial.<br />(ddas)</li>
     </ol>
   </li>
   <li><a href="javascript:toggleList('release_0.18.0_-_unreleased_._optimizations_')">  OPTIMIZATIONS
-</a>&nbsp;&nbsp;&nbsp;(10)
+</a>&nbsp;&nbsp;&nbsp;(9)
     <ol id="release_0.18.0_-_unreleased_._optimizations_">
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3274">HADOOP-3274</a>. The default constructor of BytesWritable creates empty
 byte array. (Tsz Wo (Nicholas), SZE via shv)
 </li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3272">HADOOP-3272</a>. Remove redundant copy of Block object in BlocksMap.<br />(Lohit Vjayarenu via shv)</li>
-      <li><a href="http://issues.apache.org/jira/browse/HADOOP-1979">HADOOP-1979</a>. Speed up fsck by adding a buffered stream.<br />(Lohit
-Vijaya Renu via omalley)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3164">HADOOP-3164</a>. Reduce DataNode CPU usage by using FileChannel.tranferTo().
 On Linux DataNode takes 5 times less CPU while serving data. Results may
 vary on other platforms.<br />(rangadi)</li>
@@ -421,11 +424,7 @@ security manager non-fatal.<br />(Edward Yoon via omalley)</li>
 instead of removed getFileCacheHints.<br />(lohit vijayarenu via cdouglas)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3401">HADOOP-3401</a>. Update FileBench to set the new
 "mapred.work.output.dir" property to work post-3041.<br />(cdouglas via omalley)</li>
-      <li><a href="http://issues.apache.org/jira/browse/HADOOP-2159">HADOOP-2159</a> Namenode stuck in safemode. The counter blockSafe should
-not be decremented for invalid blocks.<br />(hairong)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-2669">HADOOP-2669</a>. DFSClient locks pendingCreates appropriately.<br />(dhruba)</li>
-      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3477">HADOOP-3477</a>. Fix build to not package contrib/*/bin twice in
-distributions.<br />(Adam Heath via cutting)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3410">HADOOP-3410</a>. Fix KFS implemenation to return correct file
 modification time.<br />(Sriram Rao via cutting)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3340">HADOOP-3340</a>. Fix DFS metrics for BlocksReplicated, HeartbeatsNum, and
@@ -434,8 +433,6 @@ BlockReportsAverageTime.<br />(lohit vijayarenu via cdouglas)</li>
 /bin/bash and fix the test patch to require bash instead of sh.<br />(Brice Arnould via omalley)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3471">HADOOP-3471</a>. Fix spurious errors from TestIndexedSort and add additional
 logging to let failures be reproducible.<br />(cdouglas)</li>
-      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3475">HADOOP-3475</a>. Fix MapTask to correctly size the accounting allocation of
-io.sort.mb.<br />(cdouglas)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3443">HADOOP-3443</a>. Avoid copying map output across partitions when renaming a
 single spill.<br />(omalley via cdouglas)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3454">HADOOP-3454</a>. Fix Text::find to search only valid byte ranges.<br />(Chad Whipkey
@@ -444,8 +441,6 @@ via cdouglas)</li>
 JobClient. Moves the cli parsing from JobShell to GenericOptionsParser.
 Thus removes the class org.apache.hadoop.mapred.JobShell.<br />(Amareshwari Sriramadasu via ddas)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-2132">HADOOP-2132</a>. Only RUNNING/PREP jobs can be killed.<br />(Jothi Padmanabhan via ddas)</li>
-      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3472">HADOOP-3472</a> MapFile.Reader getClosest() function returns incorrect results
-when before is true<br />(Todd Lipcon via Stack)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3476">HADOOP-3476</a>. Code cleanup in fuse-dfs.<br />(Peter Wyckoff via dhruba)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-2427">HADOOP-2427</a>. Ensure that the cwd of completed tasks is cleaned-up
 correctly on task-completion.<br />(Amareshwari Sri Ramadasu via acmurthy)</li>
@@ -483,9 +478,6 @@ with a configuration.<br />(Subramaniam Krishnan via omalley)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3519">HADOOP-3519</a>.  Fix NPE in DFS FileSystem rename.<br />(hairong via tomwhite)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3528">HADOOP-3528</a>. Metrics FilesCreated and files_deleted metrics
 do not match.<br />(Lohit via Mahadev)</li>
-      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3442">HADOOP-3442</a>. Limit recursion depth on the stack for QuickSort to prevent
-StackOverflowErrors. To avoid O(n*n) cases, when partitioning depth exceeds
-a multiple of log(n), change to HeapSort.<br />(cdouglas)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3418">HADOOP-3418</a>. When a directory is deleted, any leases that point to files
 in the subdirectory are removed. ((Tsz Wo (Nicholas), SZE via dhruba)
 </li>
@@ -499,11 +491,61 @@ merge may be missed.<br />(Arun Murthy via ddas)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3560">HADOOP-3560</a>. Fixes a problem to do with split creation in archives.<br />(Mahadev Konar via ddas)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3545">HADOOP-3545</a>. Fixes a overflow problem in archives.<br />(Mahadev Konar via ddas)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3561">HADOOP-3561</a>. Prevent the trash from deleting its parent directories.<br />(cdouglas)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3575">HADOOP-3575</a>. Fix the clover ant target after package refactoring.<br />(Nigel Daley via cdouglas)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3539">HADOOP-3539</a>.  Fix the tool path in the bin/hadoop script under
+cygwin. (Tsz Wo (Nicholas), Sze via omalley)
+</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3520">HADOOP-3520</a>.  TestDFSUpgradeFromImage triggers a race condition in the
+Upgrade Manager. Fixed.<br />(dhruba)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3586">HADOOP-3586</a>. Provide deprecated, backwards compatibile semantics for the
+combiner to be run once and only once on each record.<br />(cdouglas)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3533">HADOOP-3533</a>. Add deprecated methods to provide API compatibility
+between 0.18 and 0.17. Remove the deprecated methods in trunk.<br />(omalley)</li>
     </ol>
   </li>
 </ul>
 <h2><a href="javascript:toggleList('older')">Older Releases</a></h2>
 <ul id="older">
+<h3><a href="javascript:toggleList('release_0.17.1_-_unreleased_')">Release 0.17.1 - Unreleased
+</a></h3>
+<ul id="release_0.17.1_-_unreleased_">
+  <li><a href="javascript:toggleList('release_0.17.1_-_unreleased_._incompatible_changes_')">  INCOMPATIBLE CHANGES
+</a>&nbsp;&nbsp;&nbsp;(1)
+    <ol id="release_0.17.1_-_unreleased_._incompatible_changes_">
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3565">HADOOP-3565</a>. Fix the Java serialization, which is not enabled by
+default, to clear the state of the serializer between objects.<br />(tomwhite via omalley)</li>
+    </ol>
+  </li>
+  <li><a href="javascript:toggleList('release_0.17.1_-_unreleased_._improvements_')">  IMPROVEMENTS
+</a>&nbsp;&nbsp;&nbsp;(1)
+    <ol id="release_0.17.1_-_unreleased_._improvements_">
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3522">HADOOP-3522</a>. Improve documentation on reduce pointing out that
+input keys and values will be reused.<br />(omalley)</li>
+    </ol>
+  </li>
+  <li><a href="javascript:toggleList('release_0.17.1_-_unreleased_._bug_fixes_')">  BUG FIXES
+</a>&nbsp;&nbsp;&nbsp;(8)
+    <ol id="release_0.17.1_-_unreleased_._bug_fixes_">
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-2159">HADOOP-2159</a> Namenode stuck in safemode. The counter blockSafe should
+not be decremented for invalid blocks.<br />(hairong)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3472">HADOOP-3472</a> MapFile.Reader getClosest() function returns incorrect results
+when before is true<br />(Todd Lipcon via Stack)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3442">HADOOP-3442</a>. Limit recursion depth on the stack for QuickSort to prevent
+StackOverflowErrors. To avoid O(n*n) cases, when partitioning depth exceeds
+a multiple of log(n), change to HeapSort.<br />(cdouglas)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3477">HADOOP-3477</a>. Fix build to not package contrib/*/bin twice in
+distributions.<br />(Adam Heath via cutting)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3475">HADOOP-3475</a>. Fix MapTask to correctly size the accounting allocation of
+io.sort.mb.<br />(cdouglas)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3550">HADOOP-3550</a>. Fix the serialization data structures in MapTask where the
+value lengths are incorrectly calculated.<br />(cdouglas)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3526">HADOOP-3526</a>. Fix contrib/data_join framework by cloning values retained
+in the reduce.<br />(Spyros Blanas via cdouglas)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-1979">HADOOP-1979</a>. Speed up fsck by adding a buffered stream.<br />(Lohit
+Vijaya Renu via omalley)</li>
+    </ol>
+  </li>
+</ul>
 <h3><a href="javascript:toggleList('release_0.17.0_-_2008-05-18_')">Release 0.17.0 - 2008-05-18
 </a></h3>
 <ul id="release_0.17.0_-_2008-05-18_">

+ 40 - 37
docs/mapred_tutorial.html

@@ -1296,14 +1296,14 @@ document.write("Last Published: " + document.lastModified);
           the intermediate outputs, which helps to cut down the amount of data 
           transferred from the <span class="codefrag">Mapper</span> to the <span class="codefrag">Reducer</span>.
           </p>
-<p>The intermediate, sorted outputs are always stored in files of 
-          <a href="api/org/apache/hadoop/io/SequenceFile.html">
-          SequenceFile</a> format. Applications can control if, and how, the 
+<p>The intermediate, sorted outputs are always stored in a simple 
+          (key-len, key, value-len, value) format. 
+          Applications can control if, and how, the 
           intermediate outputs are to be compressed and the 
           <a href="api/org/apache/hadoop/io/compress/CompressionCodec.html">
           CompressionCodec</a> to be used via the <span class="codefrag">JobConf</span>.
           </p>
-<a name="N1066B"></a><a name="How+Many+Maps%3F"></a>
+<a name="N10667"></a><a name="How+Many+Maps%3F"></a>
 <h5>How Many Maps?</h5>
 <p>The number of maps is usually driven by the total size of the 
             inputs, that is, the total number of blocks of the input files.</p>
@@ -1316,7 +1316,7 @@ document.write("Last Published: " + document.lastModified);
             <a href="api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)">
             setNumMapTasks(int)</a> (which only provides a hint to the framework) 
             is used to set it even higher.</p>
-<a name="N10683"></a><a name="Reducer"></a>
+<a name="N1067F"></a><a name="Reducer"></a>
 <h4>Reducer</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/Reducer.html">
@@ -1339,18 +1339,18 @@ document.write("Last Published: " + document.lastModified);
 <p>
 <span class="codefrag">Reducer</span> has 3 primary phases: shuffle, sort and reduce.
           </p>
-<a name="N106B3"></a><a name="Shuffle"></a>
+<a name="N106AF"></a><a name="Shuffle"></a>
 <h5>Shuffle</h5>
 <p>Input to the <span class="codefrag">Reducer</span> is the sorted output of the
             mappers. In this phase the framework fetches the relevant partition 
             of the output of all the mappers, via HTTP.</p>
-<a name="N106C0"></a><a name="Sort"></a>
+<a name="N106BC"></a><a name="Sort"></a>
 <h5>Sort</h5>
 <p>The framework groups <span class="codefrag">Reducer</span> inputs by keys (since 
             different mappers may have output the same key) in this stage.</p>
 <p>The shuffle and sort phases occur simultaneously; while 
             map-outputs are being fetched they are merged.</p>
-<a name="N106CF"></a><a name="Secondary+Sort"></a>
+<a name="N106CB"></a><a name="Secondary+Sort"></a>
 <h5>Secondary Sort</h5>
 <p>If equivalence rules for grouping the intermediate keys are 
               required to be different from those for grouping keys before 
@@ -1361,7 +1361,7 @@ document.write("Last Published: " + document.lastModified);
               JobConf.setOutputKeyComparatorClass(Class)</a> can be used to 
               control how intermediate keys are grouped, these can be used in 
               conjunction to simulate <em>secondary sort on values</em>.</p>
-<a name="N106E8"></a><a name="Reduce"></a>
+<a name="N106E4"></a><a name="Reduce"></a>
 <h5>Reduce</h5>
 <p>In this phase the 
             <a href="api/org/apache/hadoop/mapred/Reducer.html#reduce(K2, java.util.Iterator, org.apache.hadoop.mapred.OutputCollector, org.apache.hadoop.mapred.Reporter)">
@@ -1377,7 +1377,7 @@ document.write("Last Published: " + document.lastModified);
             progress, set application-level status messages and update 
             <span class="codefrag">Counters</span>, or just indicate that they are alive.</p>
 <p>The output of the <span class="codefrag">Reducer</span> is <em>not sorted</em>.</p>
-<a name="N10716"></a><a name="How+Many+Reduces%3F"></a>
+<a name="N10712"></a><a name="How+Many+Reduces%3F"></a>
 <h5>How Many Reduces?</h5>
 <p>The right number of reduces seems to be <span class="codefrag">0.95</span> or 
             <span class="codefrag">1.75</span> multiplied by (&lt;<em>no. of nodes</em>&gt; * 
@@ -1392,17 +1392,17 @@ document.write("Last Published: " + document.lastModified);
 <p>The scaling factors above are slightly less than whole numbers to 
             reserve a few reduce slots in the framework for speculative-tasks and
             failed tasks.</p>
-<a name="N1073B"></a><a name="Reducer+NONE"></a>
+<a name="N10737"></a><a name="Reducer+NONE"></a>
 <h5>Reducer NONE</h5>
 <p>It is legal to set the number of reduce-tasks to <em>zero</em> if 
             no reduction is desired.</p>
 <p>In this case the outputs of the map-tasks go directly to the
             <span class="codefrag">FileSystem</span>, into the output path set by 
-            <a href="api/org/apache/hadoop/mapred/FileInputFormat.html#setOutputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)">
+            <a href="api/org/apache/hadoop/mapred/FileOutputFormat.html#setOutputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)">
             setOutputPath(Path)</a>. The framework does not sort the 
             map-outputs before writing them out to the <span class="codefrag">FileSystem</span>.
             </p>
-<a name="N10756"></a><a name="Partitioner"></a>
+<a name="N10752"></a><a name="Partitioner"></a>
 <h4>Partitioner</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/Partitioner.html">
@@ -1416,7 +1416,7 @@ document.write("Last Published: " + document.lastModified);
 <p>
 <a href="api/org/apache/hadoop/mapred/lib/HashPartitioner.html">
           HashPartitioner</a> is the default <span class="codefrag">Partitioner</span>.</p>
-<a name="N10775"></a><a name="Reporter"></a>
+<a name="N10771"></a><a name="Reporter"></a>
 <h4>Reporter</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/Reporter.html">
@@ -1435,7 +1435,7 @@ document.write("Last Published: " + document.lastModified);
           </p>
 <p>Applications can also update <span class="codefrag">Counters</span> using the 
           <span class="codefrag">Reporter</span>.</p>
-<a name="N1079F"></a><a name="OutputCollector"></a>
+<a name="N1079B"></a><a name="OutputCollector"></a>
 <h4>OutputCollector</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/OutputCollector.html">
@@ -1446,7 +1446,7 @@ document.write("Last Published: " + document.lastModified);
 <p>Hadoop Map-Reduce comes bundled with a 
         <a href="api/org/apache/hadoop/mapred/lib/package-summary.html">
         library</a> of generally useful mappers, reducers, and partitioners.</p>
-<a name="N107BA"></a><a name="Job+Configuration"></a>
+<a name="N107B6"></a><a name="Job+Configuration"></a>
 <h3 class="h4">Job Configuration</h3>
 <p>
 <a href="api/org/apache/hadoop/mapred/JobConf.html">
@@ -1486,7 +1486,7 @@ document.write("Last Published: " + document.lastModified);
         and (<a href="api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths(org.apache.hadoop.mapred.JobConf,%20java.lang.String)">setInputPaths(JobConf, String)</a>
         /<a href="api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20java.lang.String)">addInputPaths(JobConf, String)</a>)
         and where the output files should be written
-        (<a href="api/org/apache/hadoop/mapred/FileInputFormat.html#setOutputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)">setOutputPath(Path)</a>).</p>
+        (<a href="api/org/apache/hadoop/mapred/FileOutputFormat.html#setOutputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)">setOutputPath(Path)</a>).</p>
 <p>Optionally, <span class="codefrag">JobConf</span> is used to specify other advanced 
         facets of the job such as the <span class="codefrag">Comparator</span> to be used, files 
         to be put in the <span class="codefrag">DistributedCache</span>, whether intermediate 
@@ -1504,7 +1504,7 @@ document.write("Last Published: " + document.lastModified);
         <a href="api/org/apache/hadoop/conf/Configuration.html#set(java.lang.String, java.lang.String)">set(String, String)</a>/<a href="api/org/apache/hadoop/conf/Configuration.html#get(java.lang.String, java.lang.String)">get(String, String)</a>
         to set/get arbitrary parameters needed by applications. However, use the 
         <span class="codefrag">DistributedCache</span> for large amounts of (read-only) data.</p>
-<a name="N1084C"></a><a name="Task+Execution+%26+Environment"></a>
+<a name="N10848"></a><a name="Task+Execution+%26+Environment"></a>
 <h3 class="h4">Task Execution &amp; Environment</h3>
 <p>The <span class="codefrag">TaskTracker</span> executes the <span class="codefrag">Mapper</span>/ 
         <span class="codefrag">Reducer</span>  <em>task</em> as a child process in a separate jvm.
@@ -1741,7 +1741,7 @@ document.write("Last Published: " + document.lastModified);
         loaded via <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/lang/System.html#loadLibrary(java.lang.String)">
         System.loadLibrary</a> or <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/lang/System.html#load(java.lang.String)">
         System.load</a>.</p>
-<a name="N109F7"></a><a name="Job+Submission+and+Monitoring"></a>
+<a name="N109F3"></a><a name="Job+Submission+and+Monitoring"></a>
 <h3 class="h4">Job Submission and Monitoring</h3>
 <p>
 <a href="api/org/apache/hadoop/mapred/JobClient.html">
@@ -1802,7 +1802,7 @@ document.write("Last Published: " + document.lastModified);
 <p>Normally the user creates the application, describes various facets 
         of the job via <span class="codefrag">JobConf</span>, and then uses the 
         <span class="codefrag">JobClient</span> to submit the job and monitor its progress.</p>
-<a name="N10A57"></a><a name="Job+Control"></a>
+<a name="N10A53"></a><a name="Job+Control"></a>
 <h4>Job Control</h4>
 <p>Users may need to chain map-reduce jobs to accomplish complex
           tasks which cannot be done via a single map-reduce job. This is fairly
@@ -1838,7 +1838,7 @@ document.write("Last Published: " + document.lastModified);
             </li>
           
 </ul>
-<a name="N10A81"></a><a name="Job+Input"></a>
+<a name="N10A7D"></a><a name="Job+Input"></a>
 <h3 class="h4">Job Input</h3>
 <p>
 <a href="api/org/apache/hadoop/mapred/InputFormat.html">
@@ -1886,7 +1886,7 @@ document.write("Last Published: " + document.lastModified);
         appropriate <span class="codefrag">CompressionCodec</span>. However, it must be noted that
         compressed files with the above extensions cannot be <em>split</em> and 
         each compressed file is processed in its entirety by a single mapper.</p>
-<a name="N10AEB"></a><a name="InputSplit"></a>
+<a name="N10AE7"></a><a name="InputSplit"></a>
 <h4>InputSplit</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/InputSplit.html">
@@ -1900,7 +1900,7 @@ document.write("Last Published: " + document.lastModified);
           FileSplit</a> is the default <span class="codefrag">InputSplit</span>. It sets 
           <span class="codefrag">map.input.file</span> to the path of the input file for the
           logical split.</p>
-<a name="N10B10"></a><a name="RecordReader"></a>
+<a name="N10B0C"></a><a name="RecordReader"></a>
 <h4>RecordReader</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/RecordReader.html">
@@ -1912,7 +1912,7 @@ document.write("Last Published: " + document.lastModified);
           for processing. <span class="codefrag">RecordReader</span> thus assumes the 
           responsibility of processing record boundaries and presents the tasks 
           with keys and values.</p>
-<a name="N10B33"></a><a name="Job+Output"></a>
+<a name="N10B2F"></a><a name="Job+Output"></a>
 <h3 class="h4">Job Output</h3>
 <p>
 <a href="api/org/apache/hadoop/mapred/OutputFormat.html">
@@ -1937,7 +1937,7 @@ document.write("Last Published: " + document.lastModified);
 <p>
 <span class="codefrag">TextOutputFormat</span> is the default 
         <span class="codefrag">OutputFormat</span>.</p>
-<a name="N10B5C"></a><a name="Task+Side-Effect+Files"></a>
+<a name="N10B58"></a><a name="Task+Side-Effect+Files"></a>
 <h4>Task Side-Effect Files</h4>
 <p>In some applications, component tasks need to create and/or write to
           side-files, which differ from the actual job-output files.</p>
@@ -1961,7 +1961,7 @@ document.write("Last Published: " + document.lastModified);
 <p>The application-writer can take advantage of this feature by 
           creating any side-files required in <span class="codefrag">${mapred.work.output.dir}</span>
           during execution of a task via 
-          <a href="api/org/apache/hadoop/mapred/FileInputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)">
+          <a href="api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)">
           FileOutputFormat.getWorkOutputPath()</a>, and the framework will promote them 
           similarly for succesful task-attempts, thus eliminating the need to 
           pick unique paths per task-attempt.</p>
@@ -1970,13 +1970,13 @@ document.write("Last Published: " + document.lastModified);
           <span class="codefrag">${mapred.output.dir}/_temporary/_{$taskid}</span>, and this value is 
           set by the map-reduce framework. So, just create any side-files in the 
           path  returned by
-          <a href="api/org/apache/hadoop/mapred/FileInputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)">
+          <a href="api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)">
           FileOutputFormat.getWorkOutputPath() </a>from map/reduce 
           task to take advantage of this feature.</p>
 <p>The entire discussion holds true for maps of jobs with 
            reducer=NONE (i.e. 0 reduces) since output of the map, in that case, 
            goes directly to HDFS.</p>
-<a name="N10BA4"></a><a name="RecordWriter"></a>
+<a name="N10BA0"></a><a name="RecordWriter"></a>
 <h4>RecordWriter</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/RecordWriter.html">
@@ -1984,9 +1984,9 @@ document.write("Last Published: " + document.lastModified);
           pairs to an output file.</p>
 <p>RecordWriter implementations write the job outputs to the 
           <span class="codefrag">FileSystem</span>.</p>
-<a name="N10BBB"></a><a name="Other+Useful+Features"></a>
+<a name="N10BB7"></a><a name="Other+Useful+Features"></a>
 <h3 class="h4">Other Useful Features</h3>
-<a name="N10BC1"></a><a name="Counters"></a>
+<a name="N10BBD"></a><a name="Counters"></a>
 <h4>Counters</h4>
 <p>
 <span class="codefrag">Counters</span> represent global counters, defined either by 
@@ -1997,7 +1997,10 @@ document.write("Last Published: " + document.lastModified);
 <p>Applications can define arbitrary <span class="codefrag">Counters</span> (of type 
           <span class="codefrag">Enum</span>) and update them via 
           <a href="api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.Enum, long)">
-          Reporter.incrCounter(Enum, long)</a> in the <span class="codefrag">map</span> and/or 
+          Reporter.incrCounter(Enum, long)</a> or 
+          <a href="api/org/apache/hadoop/mapred/Reporter.html#incrCounter(java.lang.String, java.lang.String, long amount)">
+          Reporter.incrCounter(String, String, long)</a>
+          in the <span class="codefrag">map</span> and/or 
           <span class="codefrag">reduce</span> methods. These counters are then globally 
           aggregated by the framework.</p>
 <a name="N10BEC"></a><a name="DistributedCache"></a>
@@ -2010,8 +2013,8 @@ document.write("Last Published: " + document.lastModified);
 <span class="codefrag">DistributedCache</span> is a facility provided by the 
           Map-Reduce framework to cache files (text, archives, jars and so on) 
           needed by applications.</p>
-<p>Applications specify the files to be cached via urls (hdfs:// or 
-          http://) in the <span class="codefrag">JobConf</span>. The <span class="codefrag">DistributedCache</span> 
+<p>Applications specify the files to be cached via urls (hdfs://)
+          in the <span class="codefrag">JobConf</span>. The <span class="codefrag">DistributedCache</span> 
           assumes that the files specified via hdfs:// urls are already present 
           on the <span class="codefrag">FileSystem</span>.</p>
 <p>The framework will copy the necessary files to the slave node 
@@ -2225,11 +2228,11 @@ document.write("Last Published: " + document.lastModified);
 <a name="N10D57"></a><a name="Job+Outputs"></a>
 <h5>Job Outputs</h5>
 <p>Applications can control compression of job-outputs via the
-            <a href="api/org/apache/hadoop/mapred/OutputFormatBase.html#setCompressOutput(org.apache.hadoop.mapred.JobConf,%20boolean)">
-            OutputFormatBase.setCompressOutput(JobConf, boolean)</a> api and the 
+            <a href="api/org/apache/hadoop/mapred/FileOutputFormat.html#setCompressOutput(org.apache.hadoop.mapred.JobConf,%20boolean)">
+            FileOutputFormat.setCompressOutput(JobConf, boolean)</a> api and the 
             <span class="codefrag">CompressionCodec</span> to be used can be specified via the
-            <a href="api/org/apache/hadoop/mapred/OutputFormatBase.html#setOutputCompressorClass(org.apache.hadoop.mapred.JobConf,%20java.lang.Class)">
-            OutputFormatBase.setOutputCompressorClass(JobConf, Class)</a> api.</p>
+            <a href="api/org/apache/hadoop/mapred/FileOutputFormat.html#setOutputCompressorClass(org.apache.hadoop.mapred.JobConf,%20java.lang.Class)">
+            FileOutputFormat.setOutputCompressorClass(JobConf, Class)</a> api.</p>
 <p>If the job outputs are to be stored in the 
             <a href="api/org/apache/hadoop/mapred/SequenceFileOutputFormat.html">
             SequenceFileOutputFormat</a>, the required

File diff suppressed because it is too large
+ 1 - 1
docs/mapred_tutorial.pdf


+ 14 - 11
src/docs/src/documentation/content/xdocs/mapred_tutorial.xml

@@ -771,9 +771,9 @@
           transferred from the <code>Mapper</code> to the <code>Reducer</code>.
           </p>
  
-          <p>The intermediate, sorted outputs are always stored in files of 
-          <a href="ext:api/org/apache/hadoop/io/sequencefile">
-          SequenceFile</a> format. Applications can control if, and how, the 
+          <p>The intermediate, sorted outputs are always stored in a simple 
+          (key-len, key, value-len, value) format. 
+          Applications can control if, and how, the 
           intermediate outputs are to be compressed and the 
           <a href="ext:api/org/apache/hadoop/io/compress/compressioncodec">
           CompressionCodec</a> to be used via the <code>JobConf</code>.
@@ -1469,8 +1469,11 @@
           
           <p>Applications can define arbitrary <code>Counters</code> (of type 
           <code>Enum</code>) and update them via 
-          <a href="ext:api/org/apache/hadoop/mapred/reporter/incrcounter">
-          Reporter.incrCounter(Enum, long)</a> in the <code>map</code> and/or 
+          <a href="ext:api/org/apache/hadoop/mapred/reporter/incrcounterEnum">
+          Reporter.incrCounter(Enum, long)</a> or 
+          <a href="ext:api/org/apache/hadoop/mapred/reporter/incrcounterString">
+          Reporter.incrCounter(String, String, long)</a>
+          in the <code>map</code> and/or 
           <code>reduce</code> methods. These counters are then globally 
           aggregated by the framework.</p>
         </section>       
@@ -1486,8 +1489,8 @@
           Map-Reduce framework to cache files (text, archives, jars and so on) 
           needed by applications.</p>
  
-          <p>Applications specify the files to be cached via urls (hdfs:// or 
-          http://) in the <code>JobConf</code>. The <code>DistributedCache</code> 
+          <p>Applications specify the files to be cached via urls (hdfs://)
+          in the <code>JobConf</code>. The <code>DistributedCache</code> 
           assumes that the files specified via hdfs:// urls are already present 
           on the <code>FileSystem</code>.</p>
 
@@ -1719,11 +1722,11 @@
             <title>Job Outputs</title>
             
             <p>Applications can control compression of job-outputs via the
-            <a href="ext:api/org/apache/hadoop/mapred/outputformatbase/setcompressoutput">
-            OutputFormatBase.setCompressOutput(JobConf, boolean)</a> api and the 
+            <a href="ext:api/org/apache/hadoop/mapred/fileoutputformat/setcompressoutput">
+            FileOutputFormat.setCompressOutput(JobConf, boolean)</a> api and the 
             <code>CompressionCodec</code> to be used can be specified via the
-            <a href="ext:api/org/apache/hadoop/mapred/outputformatbase/setoutputcompressorclass">
-            OutputFormatBase.setOutputCompressorClass(JobConf, Class)</a> api.</p>
+            <a href="ext:api/org/apache/hadoop/mapred/fileoutputformat/setoutputcompressorclass">
+            FileOutputFormat.setOutputCompressorClass(JobConf, Class)</a> api.</p>
             
             <p>If the job outputs are to be stored in the 
             <a href="ext:api/org/apache/hadoop/mapred/sequencefileoutputformat">

+ 5 - 2
src/docs/src/documentation/content/xdocs/site.xml

@@ -136,10 +136,12 @@ See http://forrest.apache.org/docs/linking.html for more info.
                  <setinputpathstring href="#setInputPaths(org.apache.hadoop.mapred.JobConf,%20java.lang.String)" />
                  <addinputpathstring href="#addInputPath(org.apache.hadoop.mapred.JobConf,%20java.lang.String)" />
               </fileinputformat>
-              <fileoutputformat href="FileInputFormat.html">
+              <fileoutputformat href="FileOutputFormat.html">
                 <getoutputpath href="#getOutputPath(org.apache.hadoop.mapred.JobConf)" />
                 <getworkoutputpath href="#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)" />
                 <setoutputpath href="#setOutputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)" />
+                <setcompressoutput href="#setCompressOutput(org.apache.hadoop.mapred.JobConf,%20boolean)" />
+                <setoutputcompressorclass href="#setOutputCompressorClass(org.apache.hadoop.mapred.JobConf,%20java.lang.Class)" />
               </fileoutputformat>
               <filesplit href="FileSplit.html" />
               <inputformat href="InputFormat.html" />
@@ -200,7 +202,8 @@ See http://forrest.apache.org/docs/linking.html for more info.
                 <reduce href="#reduce(K2, java.util.Iterator, org.apache.hadoop.mapred.OutputCollector, org.apache.hadoop.mapred.Reporter)" />
               </reducer>
               <reporter href="Reporter.html">
-                <incrcounter href="#incrCounter(java.lang.Enum, long)" />
+                <incrcounterEnum href="#incrCounter(java.lang.Enum, long)" />
+                <incrcounterString href="#incrCounter(java.lang.String, java.lang.String, long amount)" />
               </reporter>
               <runningjob href="RunningJob.html" />
               <textinputformat href="TextInputFormat.html" />

Some files were not shown because too many files changed in this diff