浏览代码

HADOOP-4739. Fix spelling and grammar, improve phrasing of some sections in
mapred tutorial. Contributed by Vivek Ratan.


git-svn-id: https://svn.apache.org/repos/asf/hadoop/core/trunk@721789 13f79535-47bb-0310-9956-ffa450edef68

Christopher Douglas 16 年之前
父节点
当前提交
bc4dc55e58
共有 4 个文件被更改,包括 199 次插入192 次删除
  1. 5 0
      CHANGES.txt
  2. 107 105
      docs/mapred_tutorial.html
  3. 2 2
      docs/mapred_tutorial.pdf
  4. 85 85
      src/docs/src/documentation/content/xdocs/mapred_tutorial.xml

+ 5 - 0
CHANGES.txt

@@ -235,6 +235,11 @@ Trunk (unreleased changes)
 
 
 Release 0.19.1 - Unreleased
 Release 0.19.1 - Unreleased
 
 
+  IMPROVEMENTS
+
+    HADOOP-4739. Fix spelling and grammar, improve phrasing of some sections in
+    mapred tutorial. (Vivek Ratan via cdouglas)
+
   BUG FIXES
   BUG FIXES
 
 
     HADOOP-4697. Fix getBlockLocations in KosmosFileSystem to handle multiple
     HADOOP-4697. Fix getBlockLocations in KosmosFileSystem to handle multiple

+ 107 - 105
docs/mapred_tutorial.html

@@ -150,6 +150,9 @@ document.write("Last Published: " + document.lastModified);
 <a href="SLG_user_guide.html">HDFS Utilities</a>
 <a href="SLG_user_guide.html">HDFS Utilities</a>
 </div>
 </div>
 <div class="menuitem">
 <div class="menuitem">
+<a href="libhdfs.html">HDFS C API</a>
+</div>
+<div class="menuitem">
 <a href="hod_user_guide.html">HOD User Guide</a>
 <a href="hod_user_guide.html">HOD User Guide</a>
 </div>
 </div>
 <div class="menuitem">
 <div class="menuitem">
@@ -311,7 +314,7 @@ document.write("Last Published: " + document.lastModified);
 <a href="#Other+Useful+Features">Other Useful Features</a>
 <a href="#Other+Useful+Features">Other Useful Features</a>
 <ul class="minitoc">
 <ul class="minitoc">
 <li>
 <li>
-<a href="#Submitting+Jobs+to+a+Queue">Submitting Jobs to a Queue</a>
+<a href="#Submitting+Jobs+to+Queues">Submitting Jobs to Queues</a>
 </li>
 </li>
 <li>
 <li>
 <a href="#Counters">Counters</a>
 <a href="#Counters">Counters</a>
@@ -348,7 +351,7 @@ document.write("Last Published: " + document.lastModified);
 <a href="#Example%3A+WordCount+v2.0">Example: WordCount v2.0</a>
 <a href="#Example%3A+WordCount+v2.0">Example: WordCount v2.0</a>
 <ul class="minitoc">
 <ul class="minitoc">
 <li>
 <li>
-<a href="#Source+Code-N10F95">Source Code</a>
+<a href="#Source+Code-N10FA4">Source Code</a>
 </li>
 </li>
 <li>
 <li>
 <a href="#Sample+Runs">Sample Runs</a>
 <a href="#Sample+Runs">Sample Runs</a>
@@ -2280,23 +2283,26 @@ document.write("Last Published: " + document.lastModified);
           <span class="codefrag">FileSystem</span>.</p>
           <span class="codefrag">FileSystem</span>.</p>
 <a name="N10D0C"></a><a name="Other+Useful+Features"></a>
 <a name="N10D0C"></a><a name="Other+Useful+Features"></a>
 <h3 class="h4">Other Useful Features</h3>
 <h3 class="h4">Other Useful Features</h3>
-<a name="N10D12"></a><a name="Submitting+Jobs+to+a+Queue"></a>
-<h4>Submitting Jobs to a Queue</h4>
-<p>Some job schedulers supported in Hadoop, like the 
-            <a href="capacity_scheduler.html">Capacity
-            Scheduler</a>, support multiple queues. If such a scheduler is
-            being used, users can submit jobs to one of the queues
-            administrators would have defined in the
-            <em>mapred.queue.names</em> property of the Hadoop site
-            configuration. The queue name can be specified through the
-            <em>mapred.job.queue.name</em> property, or through the
-            <a href="api/org/apache/hadoop/mapred/JobConf.html#setQueueName(java.lang.String)">setQueueName(String)</a>
-            API. Note that administrators may choose to define ACLs
-            that control which queues a job can be submitted to by a
-            given user. In that case, if the job is not submitted
-            to one of the queues where the user has access,
-            the job would be rejected.</p>
-<a name="N10D2A"></a><a name="Counters"></a>
+<a name="N10D12"></a><a name="Submitting+Jobs+to+Queues"></a>
+<h4>Submitting Jobs to Queues</h4>
+<p>Users submit jobs to Queues. Queues, as collection of jobs, 
+          allow the system to provide specific functionality. For example, 
+          queues use ACLs to control which users 
+          who can submit jobs to them. Queues are expected to be primarily 
+          used by Hadoop Schedulers. </p>
+<p>Hadoop comes configured with a single mandatory queue, called 
+          'default'. Queue names are defined in the 
+          <span class="codefrag">mapred.queue.names</span> property of the Hadoop site
+          configuration. Some job schedulers, such as the 
+          <a href="capacity_scheduler.html">Capacity Scheduler</a>, 
+          support multiple queues.</p>
+<p>A job defines the queue it needs to be submitted to through the
+          <span class="codefrag">mapred.job.queue.name</span> property, or through the
+          <a href="api/org/apache/hadoop/mapred/JobConf.html#setQueueName(java.lang.String)">setQueueName(String)</a>
+          API. Setting the queue name is optional. If a job is submitted 
+          without an associated queue name, it is submitted to the 'default' 
+          queue.</p>
+<a name="N10D30"></a><a name="Counters"></a>
 <h4>Counters</h4>
 <h4>Counters</h4>
 <p>
 <p>
 <span class="codefrag">Counters</span> represent global counters, defined either by 
 <span class="codefrag">Counters</span> represent global counters, defined either by 
@@ -2313,7 +2319,7 @@ document.write("Last Published: " + document.lastModified);
           in the <span class="codefrag">map</span> and/or 
           in the <span class="codefrag">map</span> and/or 
           <span class="codefrag">reduce</span> methods. These counters are then globally 
           <span class="codefrag">reduce</span> methods. These counters are then globally 
           aggregated by the framework.</p>
           aggregated by the framework.</p>
-<a name="N10D59"></a><a name="DistributedCache"></a>
+<a name="N10D5F"></a><a name="DistributedCache"></a>
 <h4>DistributedCache</h4>
 <h4>DistributedCache</h4>
 <p>
 <p>
 <a href="api/org/apache/hadoop/filecache/DistributedCache.html">
 <a href="api/org/apache/hadoop/filecache/DistributedCache.html">
@@ -2384,7 +2390,7 @@ document.write("Last Published: " + document.lastModified);
           <span class="codefrag">mapred.job.classpath.{files|archives}</span>. Similarly the
           <span class="codefrag">mapred.job.classpath.{files|archives}</span>. Similarly the
           cached files that are symlinked into the working directory of the
           cached files that are symlinked into the working directory of the
           task can be used to distribute native libraries and load them.</p>
           task can be used to distribute native libraries and load them.</p>
-<a name="N10DDC"></a><a name="Tool"></a>
+<a name="N10DE2"></a><a name="Tool"></a>
 <h4>Tool</h4>
 <h4>Tool</h4>
 <p>The <a href="api/org/apache/hadoop/util/Tool.html">Tool</a> 
 <p>The <a href="api/org/apache/hadoop/util/Tool.html">Tool</a> 
           interface supports the handling of generic Hadoop command-line options.
           interface supports the handling of generic Hadoop command-line options.
@@ -2424,7 +2430,7 @@ document.write("Last Published: " + document.lastModified);
             </span>
             </span>
           
           
 </p>
 </p>
-<a name="N10E0E"></a><a name="IsolationRunner"></a>
+<a name="N10E14"></a><a name="IsolationRunner"></a>
 <h4>IsolationRunner</h4>
 <h4>IsolationRunner</h4>
 <p>
 <p>
 <a href="api/org/apache/hadoop/mapred/IsolationRunner.html">
 <a href="api/org/apache/hadoop/mapred/IsolationRunner.html">
@@ -2448,7 +2454,7 @@ document.write("Last Published: " + document.lastModified);
 <p>
 <p>
 <span class="codefrag">IsolationRunner</span> will run the failed task in a single 
 <span class="codefrag">IsolationRunner</span> will run the failed task in a single 
           jvm, which can be in the debugger, over precisely the same input.</p>
           jvm, which can be in the debugger, over precisely the same input.</p>
-<a name="N10E41"></a><a name="Profiling"></a>
+<a name="N10E47"></a><a name="Profiling"></a>
 <h4>Profiling</h4>
 <h4>Profiling</h4>
 <p>Profiling is a utility to get a representative (2 or 3) sample
 <p>Profiling is a utility to get a representative (2 or 3) sample
           of built-in java profiler for a sample of maps and reduces. </p>
           of built-in java profiler for a sample of maps and reduces. </p>
@@ -2481,39 +2487,40 @@ document.write("Last Published: " + document.lastModified);
           <span class="codefrag">-agentlib:hprof=cpu=samples,heap=sites,force=n,thread=y,verbose=n,file=%s</span>
           <span class="codefrag">-agentlib:hprof=cpu=samples,heap=sites,force=n,thread=y,verbose=n,file=%s</span>
           
           
 </p>
 </p>
-<a name="N10E75"></a><a name="Debugging"></a>
+<a name="N10E7B"></a><a name="Debugging"></a>
 <h4>Debugging</h4>
 <h4>Debugging</h4>
-<p>Map/Reduce framework provides a facility to run user-provided 
-          scripts for debugging. When map/reduce task fails, user can run 
-          script for doing post-processing on task logs i.e task's stdout,
-          stderr, syslog and jobconf. The stdout and stderr of the
-          user-provided debug script are printed on the diagnostics. 
-          These outputs are also displayed on job UI on demand. </p>
-<p> In the following sections we discuss how to submit debug script
-          along with the job. For submitting debug script, first it has to
-          distributed. Then the script has to supplied in Configuration. </p>
-<a name="N10E81"></a><a name="How+to+distribute+script+file%3A"></a>
-<h5> How to distribute script file: </h5>
+<p>The Map/Reduce framework provides a facility to run user-provided 
+          scripts for debugging. When a map/reduce task fails, a user can run 
+          a debug script, to process task logs for example. The script is 
+          given access to the task's stdout and stderr outputs, syslog and 
+          jobconf. The output from the debug script's stdout and stderr is 
+          displayed on the console diagnostics and also as part of the 
+          job UI. </p>
+<p> In the following sections we discuss how to submit a debug script
+          with a job. The script file needs to be distributed and submitted to 
+          the framework.</p>
+<a name="N10E87"></a><a name="How+to+distribute+the+script+file%3A"></a>
+<h5> How to distribute the script file: </h5>
 <p>
 <p>
-          The user has to use 
+          The user needs to use  
           <a href="mapred_tutorial.html#DistributedCache">DistributedCache</a>
           <a href="mapred_tutorial.html#DistributedCache">DistributedCache</a>
-          mechanism to <em>distribute</em> and <em>symlink</em> the
-          debug script file.</p>
-<a name="N10E95"></a><a name="How+to+submit+script%3A"></a>
-<h5> How to submit script: </h5>
-<p> A quick way to submit debug script is to set values for the 
-          properties "mapred.map.task.debug.script" and 
-          "mapred.reduce.task.debug.script" for debugging map task and reduce
-          task respectively. These properties can also be set by using APIs 
+          to <em>distribute</em> and <em>symlink</em> the script file.</p>
+<a name="N10E9B"></a><a name="How+to+submit+the+script%3A"></a>
+<h5> How to submit the script: </h5>
+<p> A quick way to submit the debug script is to set values for the 
+          properties <span class="codefrag">mapred.map.task.debug.script</span> and 
+          <span class="codefrag">mapred.reduce.task.debug.script</span>, for debugging map and 
+          reduce tasks respectively. These properties can also be set by using APIs 
           <a href="api/org/apache/hadoop/mapred/JobConf.html#setMapDebugScript(java.lang.String)">
           <a href="api/org/apache/hadoop/mapred/JobConf.html#setMapDebugScript(java.lang.String)">
           JobConf.setMapDebugScript(String) </a> and
           JobConf.setMapDebugScript(String) </a> and
           <a href="api/org/apache/hadoop/mapred/JobConf.html#setReduceDebugScript(java.lang.String)">
           <a href="api/org/apache/hadoop/mapred/JobConf.html#setReduceDebugScript(java.lang.String)">
-          JobConf.setReduceDebugScript(String) </a>. For streaming, debug 
-          script can be submitted with command-line options -mapdebug,
-          -reducedebug for debugging mapper and reducer respectively.</p>
-<p>The arguments of the script are task's stdout, stderr, 
+          JobConf.setReduceDebugScript(String) </a>. In streaming mode, a debug 
+          script can be submitted with the command-line options 
+          <span class="codefrag">-mapdebug</span> and <span class="codefrag">-reducedebug</span>, for debugging 
+          map and reduce tasks respectively.</p>
+<p>The arguments to the script are the task's stdout, stderr, 
           syslog and jobconf files. The debug command, run on the node where
           syslog and jobconf files. The debug command, run on the node where
-          the map/reduce failed, is: <br>
+          the map/reduce task failed, is: <br>
           
           
 <span class="codefrag"> $script $stdout $stderr $syslog $jobconf </span> 
 <span class="codefrag"> $script $stdout $stderr $syslog $jobconf </span> 
 </p>
 </p>
@@ -2523,17 +2530,17 @@ document.write("Last Published: " + document.lastModified);
 <span class="codefrag">$script $stdout $stderr $syslog $jobconf $program </span>  
 <span class="codefrag">$script $stdout $stderr $syslog $jobconf $program </span>  
           
           
 </p>
 </p>
-<a name="N10EB7"></a><a name="Default+Behavior%3A"></a>
+<a name="N10EC9"></a><a name="Default+Behavior%3A"></a>
 <h5> Default Behavior: </h5>
 <h5> Default Behavior: </h5>
 <p> For pipes, a default script is run to process core dumps under
 <p> For pipes, a default script is run to process core dumps under
           gdb, prints stack trace and gives info about running threads. </p>
           gdb, prints stack trace and gives info about running threads. </p>
-<a name="N10EC2"></a><a name="JobControl"></a>
+<a name="N10ED4"></a><a name="JobControl"></a>
 <h4>JobControl</h4>
 <h4>JobControl</h4>
 <p>
 <p>
 <a href="api/org/apache/hadoop/mapred/jobcontrol/package-summary.html">
 <a href="api/org/apache/hadoop/mapred/jobcontrol/package-summary.html">
           JobControl</a> is a utility which encapsulates a set of Map/Reduce jobs
           JobControl</a> is a utility which encapsulates a set of Map/Reduce jobs
           and their dependencies.</p>
           and their dependencies.</p>
-<a name="N10ECF"></a><a name="Data+Compression"></a>
+<a name="N10EE1"></a><a name="Data+Compression"></a>
 <h4>Data Compression</h4>
 <h4>Data Compression</h4>
 <p>Hadoop Map/Reduce provides facilities for the application-writer to
 <p>Hadoop Map/Reduce provides facilities for the application-writer to
           specify compression for both intermediate map-outputs and the
           specify compression for both intermediate map-outputs and the
@@ -2547,7 +2554,7 @@ document.write("Last Published: " + document.lastModified);
           codecs for reasons of both performance (zlib) and non-availability of
           codecs for reasons of both performance (zlib) and non-availability of
           Java libraries (lzo). More details on their usage and availability are
           Java libraries (lzo). More details on their usage and availability are
           available <a href="native_libraries.html">here</a>.</p>
           available <a href="native_libraries.html">here</a>.</p>
-<a name="N10EEF"></a><a name="Intermediate+Outputs"></a>
+<a name="N10F01"></a><a name="Intermediate+Outputs"></a>
 <h5>Intermediate Outputs</h5>
 <h5>Intermediate Outputs</h5>
 <p>Applications can control compression of intermediate map-outputs
 <p>Applications can control compression of intermediate map-outputs
             via the 
             via the 
@@ -2556,7 +2563,7 @@ document.write("Last Published: " + document.lastModified);
             <span class="codefrag">CompressionCodec</span> to be used via the
             <span class="codefrag">CompressionCodec</span> to be used via the
             <a href="api/org/apache/hadoop/mapred/JobConf.html#setMapOutputCompressorClass(java.lang.Class)">
             <a href="api/org/apache/hadoop/mapred/JobConf.html#setMapOutputCompressorClass(java.lang.Class)">
             JobConf.setMapOutputCompressorClass(Class)</a> api.</p>
             JobConf.setMapOutputCompressorClass(Class)</a> api.</p>
-<a name="N10F04"></a><a name="Job+Outputs"></a>
+<a name="N10F16"></a><a name="Job+Outputs"></a>
 <h5>Job Outputs</h5>
 <h5>Job Outputs</h5>
 <p>Applications can control compression of job-outputs via the
 <p>Applications can control compression of job-outputs via the
             <a href="api/org/apache/hadoop/mapred/FileOutputFormat.html#setCompressOutput(org.apache.hadoop.mapred.JobConf,%20boolean)">
             <a href="api/org/apache/hadoop/mapred/FileOutputFormat.html#setCompressOutput(org.apache.hadoop.mapred.JobConf,%20boolean)">
@@ -2573,64 +2580,60 @@ document.write("Last Published: " + document.lastModified);
             <a href="api/org/apache/hadoop/mapred/SequenceFileOutputFormat.html#setOutputCompressionType(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.io.SequenceFile.CompressionType)">
             <a href="api/org/apache/hadoop/mapred/SequenceFileOutputFormat.html#setOutputCompressionType(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.io.SequenceFile.CompressionType)">
             SequenceFileOutputFormat.setOutputCompressionType(JobConf, 
             SequenceFileOutputFormat.setOutputCompressionType(JobConf, 
             SequenceFile.CompressionType)</a> api.</p>
             SequenceFile.CompressionType)</a> api.</p>
-<a name="N10F31"></a><a name="Skipping+Bad+Records"></a>
+<a name="N10F43"></a><a name="Skipping+Bad+Records"></a>
 <h4>Skipping Bad Records</h4>
 <h4>Skipping Bad Records</h4>
-<p>Hadoop provides an optional mode of execution in which the bad 
-          records are detected and skipped in further attempts. 
-          Applications can control various settings via 
+<p>Hadoop provides an option where a certain set of bad input 
+          records can be skipped when processing map inputs. Applications 
+          can control this feature through the  
           <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html">
           <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html">
-          SkipBadRecords</a>.</p>
-<p>This feature can be used when map/reduce tasks crashes 
-          deterministically on certain input. This happens due to bugs in the 
-          map/reduce function. The usual course would be to fix these bugs. 
-          But sometimes this is not possible; perhaps the bug is in third party 
-          libraries for which the source code is not available. Due to this, 
-          the task never reaches to completion even with multiple attempts and 
-          complete data for that task is lost.</p>
-<p>With this feature, only a small portion of data is lost surrounding 
-          the bad record. This may be acceptable for some user applications; 
-          for example applications which are doing statistical analysis on 
-          very large data. By default this feature is disabled. For turning it 
-          on refer <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setMapperMaxSkipRecords(org.apache.hadoop.conf.Configuration, long)">
+          SkipBadRecords</a> class.</p>
+<p>This feature can be used when map tasks crash deterministically 
+          on certain input. This usually happens due to bugs in the 
+          map function. Usually, the user would have to fix these bugs. 
+          This is, however, not possible sometimes. The bug may be in third 
+          party libraries, for example, for which the source code is not 
+          available. In such cases, the task never completes successfully even
+          after multiple attempts, and the job fails. With this feature, only 
+          a small portion of data surrounding the 
+          bad records is lost, which may be acceptable for some applications 
+          (those performing statistical analysis on very large data, for 
+          example). </p>
+<p>By default this feature is disabled. For enabling it, 
+          refer to <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setMapperMaxSkipRecords(org.apache.hadoop.conf.Configuration, long)">
           SkipBadRecords.setMapperMaxSkipRecords(Configuration, long)</a> and 
           SkipBadRecords.setMapperMaxSkipRecords(Configuration, long)</a> and 
           <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setReducerMaxSkipGroups(org.apache.hadoop.conf.Configuration, long)">
           <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setReducerMaxSkipGroups(org.apache.hadoop.conf.Configuration, long)">
           SkipBadRecords.setReducerMaxSkipGroups(Configuration, long)</a>.
           SkipBadRecords.setReducerMaxSkipGroups(Configuration, long)</a>.
           </p>
           </p>
-<p>The skipping mode gets kicked off after certain no of failures
+<p>With this feature enabled, the framework gets into 'skipping 
+          mode' after a certain number of map failures. For more details, 
           see <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setAttemptsToStartSkipping(org.apache.hadoop.conf.Configuration, int)">
           see <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setAttemptsToStartSkipping(org.apache.hadoop.conf.Configuration, int)">
-          SkipBadRecords.setAttemptsToStartSkipping(Configuration, int)</a>.
-          </p>
-<p>In the skipping mode, the map/reduce task maintains the record 
-          range which is getting processed at all times. For maintaining this 
-          range, the framework relies on the processed record 
-          counter. see <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#COUNTER_MAP_PROCESSED_RECORDS">
+          SkipBadRecords.setAttemptsToStartSkipping(Configuration, int)</a>. 
+          In 'skipping mode', map tasks maintain the range of records being 
+          processed. To do this, the framework relies on the processed record 
+          counter. See <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#COUNTER_MAP_PROCESSED_RECORDS">
           SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS</a> and 
           SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS</a> and 
           <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#COUNTER_REDUCE_PROCESSED_GROUPS">
           <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#COUNTER_REDUCE_PROCESSED_GROUPS">
           SkipBadRecords.COUNTER_REDUCE_PROCESSED_GROUPS</a>. 
           SkipBadRecords.COUNTER_REDUCE_PROCESSED_GROUPS</a>. 
-          Based on this counter, the framework knows that how 
-          many records have been processed successfully by mapper/reducer.
-          Before giving the 
-          input to the map/reduce function, it sends this record range to the 
-          Task tracker. If task crashes, the Task tracker knows which one was 
-          the last reported range. On further attempts that range get skipped.
-          </p>
-<p>The number of records skipped for a single bad record depends on 
-          how frequent, the processed counters are incremented by the application. 
-          It is recommended to increment the counter after processing every 
-          single record. However in some applications this might be difficult as 
-          they may be batching up their processing. In that case, the framework 
-          might skip more records surrounding the bad record. If users want to 
-          reduce the number of records skipped, then they can specify the 
-          acceptable value using 
+          This counter enables the framework to know how many records have 
+          been processed successfully, and hence, what record range caused 
+          a task to crash. On further attempts, this range of records is 
+          skipped.</p>
+<p>The number of records skipped depends on how frequently the 
+          processed record counter is incremented by the application. 
+          It is recommended that this counter be incremented after every 
+          record is processed. This may not be possible in some applications 
+          that typically batch their processing. In such cases, the framework 
+          may skip additional records surrounding the bad record. Users can 
+          control the number of skipped records through 
           <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setMapperMaxSkipRecords(org.apache.hadoop.conf.Configuration, long)">
           <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setMapperMaxSkipRecords(org.apache.hadoop.conf.Configuration, long)">
           SkipBadRecords.setMapperMaxSkipRecords(Configuration, long)</a> and 
           SkipBadRecords.setMapperMaxSkipRecords(Configuration, long)</a> and 
           <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setReducerMaxSkipGroups(org.apache.hadoop.conf.Configuration, long)">
           <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setReducerMaxSkipGroups(org.apache.hadoop.conf.Configuration, long)">
           SkipBadRecords.setReducerMaxSkipGroups(Configuration, long)</a>. 
           SkipBadRecords.setReducerMaxSkipGroups(Configuration, long)</a>. 
-          The framework tries to narrow down the skipped range by employing the 
-          binary search kind of algorithm during task re-executions. The skipped
-          range is divided into two halves and only one half get executed. 
-          Based on the subsequent failure, it figures out which half contains 
-          the bad record. This task re-execution will keep happening till 
+          The framework tries to narrow the range of skipped records using a 
+          binary search-like approach. The skipped range is divided into two 
+          halves and only one half gets executed. On subsequent 
+          failures, the framework figures out which half contains 
+          bad records. A task will be re-executed till the
           acceptable skipped value is met or all task attempts are exhausted.
           acceptable skipped value is met or all task attempts are exhausted.
           To increase the number of task attempts, use
           To increase the number of task attempts, use
           <a href="api/org/apache/hadoop/mapred/JobConf.html#setMaxMapAttempts(int)">
           <a href="api/org/apache/hadoop/mapred/JobConf.html#setMaxMapAttempts(int)">
@@ -2638,16 +2641,15 @@ document.write("Last Published: " + document.lastModified);
           <a href="api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceAttempts(int)">
           <a href="api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceAttempts(int)">
           JobConf.setMaxReduceAttempts(int)</a>.
           JobConf.setMaxReduceAttempts(int)</a>.
           </p>
           </p>
-<p>The skipped records are written to the hdfs in the sequence file 
-          format, which could be used for later analysis. The location of 
-          skipped records output path can be changed by 
+<p>Skipped records are written to HDFS in the sequence file 
+          format, for later analysis. The location can be changed through 
           <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setSkipOutputPath(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path)">
           <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setSkipOutputPath(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path)">
           SkipBadRecords.setSkipOutputPath(JobConf, Path)</a>.
           SkipBadRecords.setSkipOutputPath(JobConf, Path)</a>.
           </p>
           </p>
 </div>
 </div>
 
 
     
     
-<a name="N10F7B"></a><a name="Example%3A+WordCount+v2.0"></a>
+<a name="N10F8A"></a><a name="Example%3A+WordCount+v2.0"></a>
 <h2 class="h3">Example: WordCount v2.0</h2>
 <h2 class="h3">Example: WordCount v2.0</h2>
 <div class="section">
 <div class="section">
 <p>Here is a more complete <span class="codefrag">WordCount</span> which uses many of the
 <p>Here is a more complete <span class="codefrag">WordCount</span> which uses many of the
@@ -2657,7 +2659,7 @@ document.write("Last Published: " + document.lastModified);
       <a href="quickstart.html#SingleNodeSetup">pseudo-distributed</a> or
       <a href="quickstart.html#SingleNodeSetup">pseudo-distributed</a> or
       <a href="quickstart.html#Fully-Distributed+Operation">fully-distributed</a> 
       <a href="quickstart.html#Fully-Distributed+Operation">fully-distributed</a> 
       Hadoop installation.</p>
       Hadoop installation.</p>
-<a name="N10F95"></a><a name="Source+Code-N10F95"></a>
+<a name="N10FA4"></a><a name="Source+Code-N10FA4"></a>
 <h3 class="h4">Source Code</h3>
 <h3 class="h4">Source Code</h3>
 <table class="ForrestTable" cellspacing="1" cellpadding="4">
 <table class="ForrestTable" cellspacing="1" cellpadding="4">
           
           
@@ -3867,7 +3869,7 @@ document.write("Last Published: " + document.lastModified);
 </tr>
 </tr>
         
         
 </table>
 </table>
-<a name="N116F7"></a><a name="Sample+Runs"></a>
+<a name="N11706"></a><a name="Sample+Runs"></a>
 <h3 class="h4">Sample Runs</h3>
 <h3 class="h4">Sample Runs</h3>
 <p>Sample text-files as input:</p>
 <p>Sample text-files as input:</p>
 <p>
 <p>
@@ -4035,7 +4037,7 @@ document.write("Last Published: " + document.lastModified);
 <br>
 <br>
         
         
 </p>
 </p>
-<a name="N117CB"></a><a name="Highlights"></a>
+<a name="N117DA"></a><a name="Highlights"></a>
 <h3 class="h4">Highlights</h3>
 <h3 class="h4">Highlights</h3>
 <p>The second version of <span class="codefrag">WordCount</span> improves upon the 
 <p>The second version of <span class="codefrag">WordCount</span> improves upon the 
         previous one by using some features offered by the Map/Reduce framework:
         previous one by using some features offered by the Map/Reduce framework:

文件差异内容过多而无法显示
+ 2 - 2
docs/mapred_tutorial.pdf


+ 85 - 85
src/docs/src/documentation/content/xdocs/mapred_tutorial.xml

@@ -1679,21 +1679,26 @@
         <title>Other Useful Features</title>
         <title>Other Useful Features</title>
  
  
         <section>
         <section>
-          <title>Submitting Jobs to a Queue</title>
-          <p>Some job schedulers supported in Hadoop, like the 
-            <a href="capacity_scheduler.html">Capacity
-            Scheduler</a>, support multiple queues. If such a scheduler is
-            being used, users can submit jobs to one of the queues
-            administrators would have defined in the
-            <em>mapred.queue.names</em> property of the Hadoop site
-            configuration. The queue name can be specified through the
-            <em>mapred.job.queue.name</em> property, or through the
-            <a href="ext:api/org/apache/hadoop/mapred/jobconf/setqueuename">setQueueName(String)</a>
-            API. Note that administrators may choose to define ACLs
-            that control which queues a job can be submitted to by a
-            given user. In that case, if the job is not submitted
-            to one of the queues where the user has access,
-            the job would be rejected.</p>
+          <title>Submitting Jobs to Queues</title>
+          <p>Users submit jobs to Queues. Queues, as collection of jobs, 
+          allow the system to provide specific functionality. For example, 
+          queues use ACLs to control which users 
+          who can submit jobs to them. Queues are expected to be primarily 
+          used by Hadoop Schedulers. </p> 
+
+          <p>Hadoop comes configured with a single mandatory queue, called 
+          'default'. Queue names are defined in the 
+          <code>mapred.queue.names</code> property of the Hadoop site
+          configuration. Some job schedulers, such as the 
+          <a href="capacity_scheduler.html">Capacity Scheduler</a>, 
+          support multiple queues.</p>
+          
+          <p>A job defines the queue it needs to be submitted to through the
+          <code>mapred.job.queue.name</code> property, or through the
+          <a href="ext:api/org/apache/hadoop/mapred/jobconf/setqueuename">setQueueName(String)</a>
+          API. Setting the queue name is optional. If a job is submitted 
+          without an associated queue name, it is submitted to the 'default' 
+          queue.</p> 
         </section>
         </section>
         <section>
         <section>
           <title>Counters</title>
           <title>Counters</title>
@@ -1893,40 +1898,41 @@
         
         
         <section>
         <section>
           <title>Debugging</title>
           <title>Debugging</title>
-          <p>Map/Reduce framework provides a facility to run user-provided 
-          scripts for debugging. When map/reduce task fails, user can run 
-          script for doing post-processing on task logs i.e task's stdout,
-          stderr, syslog and jobconf. The stdout and stderr of the
-          user-provided debug script are printed on the diagnostics. 
-          These outputs are also displayed on job UI on demand. </p>
+          <p>The Map/Reduce framework provides a facility to run user-provided 
+          scripts for debugging. When a map/reduce task fails, a user can run 
+          a debug script, to process task logs for example. The script is 
+          given access to the task's stdout and stderr outputs, syslog and 
+          jobconf. The output from the debug script's stdout and stderr is 
+          displayed on the console diagnostics and also as part of the 
+          job UI. </p>
 
 
-          <p> In the following sections we discuss how to submit debug script
-          along with the job. For submitting debug script, first it has to
-          distributed. Then the script has to supplied in Configuration. </p>
+          <p> In the following sections we discuss how to submit a debug script
+          with a job. The script file needs to be distributed and submitted to 
+          the framework.</p>
           <section>
           <section>
-          <title> How to distribute script file: </title>
+          <title> How to distribute the script file: </title>
           <p>
           <p>
-          The user has to use 
+          The user needs to use  
           <a href="mapred_tutorial.html#DistributedCache">DistributedCache</a>
           <a href="mapred_tutorial.html#DistributedCache">DistributedCache</a>
-          mechanism to <em>distribute</em> and <em>symlink</em> the
-          debug script file.</p>
+          to <em>distribute</em> and <em>symlink</em> the script file.</p>
           </section>
           </section>
           <section>
           <section>
-          <title> How to submit script: </title>
-          <p> A quick way to submit debug script is to set values for the 
-          properties "mapred.map.task.debug.script" and 
-          "mapred.reduce.task.debug.script" for debugging map task and reduce
-          task respectively. These properties can also be set by using APIs 
+          <title> How to submit the script: </title>
+          <p> A quick way to submit the debug script is to set values for the 
+          properties <code>mapred.map.task.debug.script</code> and 
+          <code>mapred.reduce.task.debug.script</code>, for debugging map and 
+          reduce tasks respectively. These properties can also be set by using APIs 
           <a href="ext:api/org/apache/hadoop/mapred/jobconf/setmapdebugscript">
           <a href="ext:api/org/apache/hadoop/mapred/jobconf/setmapdebugscript">
           JobConf.setMapDebugScript(String) </a> and
           JobConf.setMapDebugScript(String) </a> and
           <a href="ext:api/org/apache/hadoop/mapred/jobconf/setreducedebugscript">
           <a href="ext:api/org/apache/hadoop/mapred/jobconf/setreducedebugscript">
-          JobConf.setReduceDebugScript(String) </a>. For streaming, debug 
-          script can be submitted with command-line options -mapdebug,
-          -reducedebug for debugging mapper and reducer respectively.</p>
+          JobConf.setReduceDebugScript(String) </a>. In streaming mode, a debug 
+          script can be submitted with the command-line options 
+          <code>-mapdebug</code> and <code>-reducedebug</code>, for debugging 
+          map and reduce tasks respectively.</p>
             
             
-          <p>The arguments of the script are task's stdout, stderr, 
+          <p>The arguments to the script are the task's stdout, stderr, 
           syslog and jobconf files. The debug command, run on the node where
           syslog and jobconf files. The debug command, run on the node where
-          the map/reduce failed, is: <br/>
+          the map/reduce task failed, is: <br/>
           <code> $script $stdout $stderr $syslog $jobconf </code> </p> 
           <code> $script $stdout $stderr $syslog $jobconf </code> </p> 
 
 
           <p> Pipes programs have the c++ program name as a fifth argument
           <p> Pipes programs have the c++ program name as a fifth argument
@@ -2003,67 +2009,62 @@
         
         
         <section>
         <section>
           <title>Skipping Bad Records</title>
           <title>Skipping Bad Records</title>
-          <p>Hadoop provides an optional mode of execution in which the bad 
-          records are detected and skipped in further attempts. 
-          Applications can control various settings via 
+          <p>Hadoop provides an option where a certain set of bad input 
+          records can be skipped when processing map inputs. Applications 
+          can control this feature through the  
           <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords">
           <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords">
-          SkipBadRecords</a>.</p>
+          SkipBadRecords</a> class.</p>
           
           
-          <p>This feature can be used when map/reduce tasks crashes 
-          deterministically on certain input. This happens due to bugs in the 
-          map/reduce function. The usual course would be to fix these bugs. 
-          But sometimes this is not possible; perhaps the bug is in third party 
-          libraries for which the source code is not available. Due to this, 
-          the task never reaches to completion even with multiple attempts and 
-          complete data for that task is lost.</p>
+          <p>This feature can be used when map tasks crash deterministically 
+          on certain input. This usually happens due to bugs in the 
+          map function. Usually, the user would have to fix these bugs. 
+          This is, however, not possible sometimes. The bug may be in third 
+          party libraries, for example, for which the source code is not 
+          available. In such cases, the task never completes successfully even
+          after multiple attempts, and the job fails. With this feature, only 
+          a small portion of data surrounding the 
+          bad records is lost, which may be acceptable for some applications 
+          (those performing statistical analysis on very large data, for 
+          example). </p>
 
 
-          <p>With this feature, only a small portion of data is lost surrounding 
-          the bad record. This may be acceptable for some user applications; 
-          for example applications which are doing statistical analysis on 
-          very large data. By default this feature is disabled. For turning it 
-          on refer <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setmappermaxskiprecords">
+          <p>By default this feature is disabled. For enabling it, 
+          refer to <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setmappermaxskiprecords">
           SkipBadRecords.setMapperMaxSkipRecords(Configuration, long)</a> and 
           SkipBadRecords.setMapperMaxSkipRecords(Configuration, long)</a> and 
           <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setreducermaxskipgroups">
           <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setreducermaxskipgroups">
           SkipBadRecords.setReducerMaxSkipGroups(Configuration, long)</a>.
           SkipBadRecords.setReducerMaxSkipGroups(Configuration, long)</a>.
           </p>
           </p>
  
  
-          <p>The skipping mode gets kicked off after certain no of failures
+          <p>With this feature enabled, the framework gets into 'skipping 
+          mode' after a certain number of map failures. For more details, 
           see <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setattemptsTostartskipping">
           see <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setattemptsTostartskipping">
-          SkipBadRecords.setAttemptsToStartSkipping(Configuration, int)</a>.
-          </p>
- 
-          <p>In the skipping mode, the map/reduce task maintains the record 
-          range which is getting processed at all times. For maintaining this 
-          range, the framework relies on the processed record 
-          counter. see <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/counter_map_processed_records">
+          SkipBadRecords.setAttemptsToStartSkipping(Configuration, int)</a>. 
+          In 'skipping mode', map tasks maintain the range of records being 
+          processed. To do this, the framework relies on the processed record 
+          counter. See <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/counter_map_processed_records">
           SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS</a> and 
           SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS</a> and 
           <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/counter_reduce_processed_groups">
           <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/counter_reduce_processed_groups">
           SkipBadRecords.COUNTER_REDUCE_PROCESSED_GROUPS</a>. 
           SkipBadRecords.COUNTER_REDUCE_PROCESSED_GROUPS</a>. 
-          Based on this counter, the framework knows that how 
-          many records have been processed successfully by mapper/reducer.
-          Before giving the 
-          input to the map/reduce function, it sends this record range to the 
-          Task tracker. If task crashes, the Task tracker knows which one was 
-          the last reported range. On further attempts that range get skipped.
-          </p>
+          This counter enables the framework to know how many records have 
+          been processed successfully, and hence, what record range caused 
+          a task to crash. On further attempts, this range of records is 
+          skipped.</p>
      
      
-          <p>The number of records skipped for a single bad record depends on 
-          how frequent, the processed counters are incremented by the application. 
-          It is recommended to increment the counter after processing every 
-          single record. However in some applications this might be difficult as 
-          they may be batching up their processing. In that case, the framework 
-          might skip more records surrounding the bad record. If users want to 
-          reduce the number of records skipped, then they can specify the 
-          acceptable value using 
+          <p>The number of records skipped depends on how frequently the 
+          processed record counter is incremented by the application. 
+          It is recommended that this counter be incremented after every 
+          record is processed. This may not be possible in some applications 
+          that typically batch their processing. In such cases, the framework 
+          may skip additional records surrounding the bad record. Users can 
+          control the number of skipped records through 
           <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setmappermaxskiprecords">
           <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setmappermaxskiprecords">
           SkipBadRecords.setMapperMaxSkipRecords(Configuration, long)</a> and 
           SkipBadRecords.setMapperMaxSkipRecords(Configuration, long)</a> and 
           <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setreducermaxskipgroups">
           <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setreducermaxskipgroups">
           SkipBadRecords.setReducerMaxSkipGroups(Configuration, long)</a>. 
           SkipBadRecords.setReducerMaxSkipGroups(Configuration, long)</a>. 
-          The framework tries to narrow down the skipped range by employing the 
-          binary search kind of algorithm during task re-executions. The skipped
-          range is divided into two halves and only one half get executed. 
-          Based on the subsequent failure, it figures out which half contains 
-          the bad record. This task re-execution will keep happening till 
+          The framework tries to narrow the range of skipped records using a 
+          binary search-like approach. The skipped range is divided into two 
+          halves and only one half gets executed. On subsequent 
+          failures, the framework figures out which half contains 
+          bad records. A task will be re-executed till the
           acceptable skipped value is met or all task attempts are exhausted.
           acceptable skipped value is met or all task attempts are exhausted.
           To increase the number of task attempts, use
           To increase the number of task attempts, use
           <a href="ext:api/org/apache/hadoop/mapred/jobconf/setmaxmapattempts">
           <a href="ext:api/org/apache/hadoop/mapred/jobconf/setmaxmapattempts">
@@ -2072,9 +2073,8 @@
           JobConf.setMaxReduceAttempts(int)</a>.
           JobConf.setMaxReduceAttempts(int)</a>.
           </p>
           </p>
           
           
-          <p>The skipped records are written to the hdfs in the sequence file 
-          format, which could be used for later analysis. The location of 
-          skipped records output path can be changed by 
+          <p>Skipped records are written to HDFS in the sequence file 
+          format, for later analysis. The location can be changed through 
           <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setskipoutputpath">
           <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setskipoutputpath">
           SkipBadRecords.setSkipOutputPath(JobConf, Path)</a>.
           SkipBadRecords.setSkipOutputPath(JobConf, Path)</a>.
           </p> 
           </p> 

部分文件因为文件数量过多而无法显示