%!s(int64=17) %!d(string=hai) anos · 339d4c7c30
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -431,6 +431,9 @@ Release 0.19.0 - Unreleased
 
															     incrementing the task attempt numbers by 1000 when the job restarts.
														
 
															     (Amar Kamat via omalley)
														
 
															+    HADOOP-4301. Adds forrest doc for the skip bad records feature.
														
 
															+    (Sharad Agarwal via ddas)
														
 
															+
														
 
															   OPTIMIZATIONS
														
 
															     HADOOP-3556. Removed lock contention in MD5Hash by changing the 
														
--- a/docs/changes.html
+++ b/docs/changes.html
@@ -36,7 +36,7 @@
 
															     function collapse() {
														
 
															       for (var i = 0; i < document.getElementsByTagName("ul").length; i++) {
														
 
															         var list = document.getElementsByTagName("ul")[i];
														
 
															-        if (list.id != 'release_0.19.0_-_unreleased_' && list.id != 'release_0.18.1_-_2008-09-17_') {
														
 
															+        if (list.id != 'release_0.19.0_-_unreleased_' && list.id != 'release_0.18.2_-_unreleased_') {
														
 
															           list.style.display = "none";
														
 
															         }
														
 
															       }
														
@@ -56,7 +56,7 @@
 
															 </a></h2>
														
 
															 <ul id="release_0.19.0_-_unreleased_">
														
 
															   <li><a href="javascript:toggleList('release_0.19.0_-_unreleased_._incompatible_changes_')">  INCOMPATIBLE CHANGES
														
 
															-</a>&nbsp;&nbsp;&nbsp;(18)
														
 
															+</a>&nbsp;&nbsp;&nbsp;(20)
														
 
															     <ol id="release_0.19.0_-_unreleased_._incompatible_changes_">
														
 
															       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3595">HADOOP-3595</a>. Remove deprecated methods for mapred.combine.once
														
 
															 functionality, which was necessary to providing backwards
														
@@ -110,10 +110,15 @@ Present Capacity: Space available for dfs,i.e. remaining+used space
 
															 DFS Used%: DFS used space/Present Capacity<br />(Suresh Srinivas via hairong)</li>
														
 
															       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3938">HADOOP-3938</a>. Disk space quotas for HDFS. This is similar to namespace
														
 
															 quotas in 0.18.<br />(rangadi)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4293">HADOOP-4293</a>. Make Configuration Writable and remove unreleased
														
 
															+WritableJobConf. Configuration.write is renamed to writeXml.<br />(omalley)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4281">HADOOP-4281</a>. Change dfsadmin to report available disk space in a format
														
 
															+consistent with the web interface as defined in <a href="http://issues.apache.org/jira/browse/HADOOP-2816">HADOOP-2816</a>.<br />(Suresh
														
 
															+Srinivas via cdouglas)</li>
														
 
															     </ol>
														
 
															   </li>
														
 
															   <li><a href="javascript:toggleList('release_0.19.0_-_unreleased_._new_features_')">  NEW FEATURES
														
 
															-</a>&nbsp;&nbsp;&nbsp;(39)
														
 
															+</a>&nbsp;&nbsp;&nbsp;(40)
														
 
															     <ol id="release_0.19.0_-_unreleased_._new_features_">
														
 
															       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3341">HADOOP-3341</a>. Allow streaming jobs to specify the field separator for map
														
 
															 and reduce input and output. The new configuration values are:
														
@@ -195,13 +200,16 @@ Enis Soztutar via acmurthy)</li>
 
															       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3019">HADOOP-3019</a>. A new library to support total order partitions.<br />(cdouglas via omalley)</li>
														
 
															       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3924">HADOOP-3924</a>. Added a 'KILLED' job status.<br />(Subramaniam Krishnan via
														
 
															 acmurthy)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-2421">HADOOP-2421</a>.  Add jdiff output to documentation, listing all API
														
 
															+changes from the prior release.<br />(cutting)</li>
														
 
															     </ol>
														
 
															   </li>
														
 
															   <li><a href="javascript:toggleList('release_0.19.0_-_unreleased_._improvements_')">  IMPROVEMENTS
														
 
															-</a>&nbsp;&nbsp;&nbsp;(68)
														
 
															+</a>&nbsp;&nbsp;&nbsp;(71)
														
 
															     <ol id="release_0.19.0_-_unreleased_._improvements_">
														
 
															       <li><a href="http://issues.apache.org/jira/browse/HADOOP-4205">HADOOP-4205</a>. hive: metastore and ql to use the refactored SerDe library.<br />(zshao)</li>
														
 
															-      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4106">HADOOP-4106</a>. libhdfs: add time, permission and user attribute support (part 2).<br />(Pete Wyckoff through zshao)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4106">HADOOP-4106</a>. libhdfs: add time, permission and user attribute support
														
 
															+(part 2).<br />(Pete Wyckoff through zshao)</li>
														
 
															       <li><a href="http://issues.apache.org/jira/browse/HADOOP-4104">HADOOP-4104</a>. libhdfs: add time, permission and user attribute support.<br />(Pete Wyckoff through zshao)</li>
														
 
															       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3908">HADOOP-3908</a>. libhdfs: better error message if llibhdfs.so doesn't exist.<br />(Pete Wyckoff through zshao)</li>
														
 
															       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3732">HADOOP-3732</a>. Delay intialization of datanode block verification till
														
@@ -230,8 +238,6 @@ each spill.<br />(Chris Douglas via acmurthy)</li>
 
															 it pluggable.<br />(Tom White and Brice Arnould via omalley)</li>
														
 
															       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3756">HADOOP-3756</a>. Minor. Remove unused dfs.client.buffer.dir from
														
 
															 hadoop-default.xml.<br />(rangadi)</li>
														
 
															-      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3327">HADOOP-3327</a>. Treats connection and read timeouts differently in the
														
 
															-shuffle and the backoff logic is dependent on the type of timeout.<br />(Jothi Padmanabhan via ddas)</li>
														
 
															       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3747">HADOOP-3747</a>. Adds counter suport for MultipleOutputs.<br />(Alejandro Abdelnur via ddas)</li>
														
 
															       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3169">HADOOP-3169</a>. LeaseChecker daemon should not be started in DFSClient
														
 
															 constructor. (TszWo (Nicholas), SZE via hairong)
														
@@ -321,6 +327,13 @@ userlogs.<br />(Vinod Kumar Vavilapalli via acmurthy)</li>
 
															 connection is closed and also remove an undesirable exception when
														
 
															 a client is stoped while there is no pending RPC request.<br />(hairong)</li>
														
 
															       <li><a href="http://issues.apache.org/jira/browse/HADOOP-4227">HADOOP-4227</a>. Remove the deprecated class org.apache.hadoop.fs.ShellCommand.<br />(szetszwo)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4006">HADOOP-4006</a>. Clean up FSConstants and move some of the constants to
														
 
															+better places.<br />(Sanjay Radia via rangadi)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4279">HADOOP-4279</a>. Trace the seeds of random sequences in append unit tests to
														
 
															+make itermitant failures reproducible.<br />(szetszwo via cdouglas)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4209">HADOOP-4209</a>. Remove the change to the format of task attempt id by
														
 
															+incrementing the task attempt numbers by 1000 when the job restarts.<br />(Amar Kamat via omalley)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4301">HADOOP-4301</a>. Adds forrest doc for the skip bad records feature.<br />(Sharad Agarwal via ddas)</li>
														
 
															     </ol>
														
 
															   </li>
														
 
															   <li><a href="javascript:toggleList('release_0.19.0_-_unreleased_._optimizations_')">  OPTIMIZATIONS
														
@@ -347,7 +360,7 @@ it from a different .crc file.<br />(Jothi Padmanabhan via ddas)</li>
 
															     </ol>
														
 
															   </li>
														
 
															   <li><a href="javascript:toggleList('release_0.19.0_-_unreleased_._bug_fixes_')">  BUG FIXES
														
 
															-</a>&nbsp;&nbsp;&nbsp;(88)
														
 
															+</a>&nbsp;&nbsp;&nbsp;(108)
														
 
															     <ol id="release_0.19.0_-_unreleased_._bug_fixes_">
														
 
															       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3563">HADOOP-3563</a>.  Refactor the distributed upgrade code so that it is
														
 
															 easier to identify datanode and namenode related code.<br />(dhruba)</li>
														
@@ -511,11 +524,71 @@ requires write permissions on the file.<br />(dhruba)</li>
 
															 query.<br />(Raghotham Murthy via dhruba)</li>
														
 
															       <li><a href="http://issues.apache.org/jira/browse/HADOOP-4090">HADOOP-4090</a>. The hive scripts pick up hadoop from HADOOP_HOME
														
 
															 and then the path.<br />(Raghotham Murthy via dhruba)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4242">HADOOP-4242</a>. Remove extra ";" in FSDirectory that blocks compilation
														
 
															+in some IDE's.<br />(szetszwo via omalley)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4249">HADOOP-4249</a>. Fix eclipse path to include the hsqldb.jar.<br />(szetszwo via
														
 
															+omalley)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4247">HADOOP-4247</a>. Move InputSampler into org.apache.hadoop.mapred.lib, so that
														
 
															+examples.jar doesn't depend on tools.jar.<br />(omalley)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4269">HADOOP-4269</a>. Fix the deprecation of LineReader by extending the new class
														
 
															+into the old name and deprecating it. Also update the tests to test the
														
 
															+new class.<br />(cdouglas via omalley)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4280">HADOOP-4280</a>. Fix conversions between seconds in C and milliseconds in
														
 
															+Java for access times for files.<br />(Pete Wyckoff via rangadi)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4254">HADOOP-4254</a>. -setSpaceQuota command does not convert "TB" extenstion to
														
 
															+terabytes properly. Implementation now uses StringUtils for parsing this.<br />(Raghu Angadi)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4259">HADOOP-4259</a>. Findbugs should run over tools.jar also.<br />(cdouglas via
														
 
															+omalley)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4275">HADOOP-4275</a>. Move public method isJobValidName from JobID to a private
														
 
															+method in JobTracker.<br />(omalley)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4173">HADOOP-4173</a>. fix failures in TestProcfsBasedProcessTree and
														
 
															+TestTaskTrackerMemoryManager tests. ProcfsBasedProcessTree and
														
 
															+memory management in TaskTracker are disabled on Windows.<br />(Vinod K V via rangadi)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4189">HADOOP-4189</a>. Fixes the history blocksize &amp; intertracker protocol version
														
 
															+issues introduced as part of <a href="http://issues.apache.org/jira/browse/HADOOP-3245">HADOOP-3245</a>.<br />(Amar Kamat via ddas)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4190">HADOOP-4190</a>. Fixes the backward compatibility issue with Job History.
														
 
															+introduced by <a href="http://issues.apache.org/jira/browse/HADOOP-3245">HADOOP-3245</a> and <a href="http://issues.apache.org/jira/browse/HADOOP-2403">HADOOP-2403</a>.<br />(Amar Kamat via ddas)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4237">HADOOP-4237</a>. Fixes the TestStreamingBadRecords.testNarrowDown testcase.<br />(Sharad Agarwal via ddas)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4274">HADOOP-4274</a>. Capacity scheduler accidently modifies the underlying
														
 
															+data structures when browing the job lists.<br />(Hemanth Yamijala via omalley)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4309">HADOOP-4309</a>. Fix eclipse-plugin compilation.<br />(cdouglas)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4232">HADOOP-4232</a>. Fix race condition in JVM reuse when multiple slots become
														
 
															+free.<br />(ddas via acmurthy)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4302">HADOOP-4302</a>. Fix a race condition in TestReduceFetch that can yield false
														
 
															+negatvies.<br />(cdouglas)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3942">HADOOP-3942</a>. Update distcp documentation to include features introduced in
														
 
															+<a href="http://issues.apache.org/jira/browse/HADOOP-3873">HADOOP-3873</a>, <a href="http://issues.apache.org/jira/browse/HADOOP-3939">HADOOP-3939</a>. (Tsz Wo (Nicholas), SZE via cdouglas)
														
 
															+</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4257">HADOOP-4257</a>. The DFS client should pick only one datanode as the candidate
														
 
															+to initiate lease recovery.  (Tsz Wo (Nicholas), SZE via dhruba)
														
 
															+</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4319">HADOOP-4319</a>. fuse-dfs dfs_read function returns as many bytes as it is
														
 
															+told to read unlesss end-of-file is reached.<br />(Pete Wyckoff via dhruba)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4246">HADOOP-4246</a>. Ensure we have the correct lower bound on the number of
														
 
															+retries for fetching map-outputs; also fixed the case where the reducer
														
 
															+automatically kills on too many unique map-outputs could not be fetched
														
 
															+for small jobs.<br />(Amareshwari Sri Ramadasu via acmurthy)</li>
														
 
															     </ol>
														
 
															   </li>
														
 
															 </ul>
														
 
															-<h2><a href="javascript:toggleList('release_0.18.1_-_2008-09-17_')">Release 0.18.1 - 2008-09-17
														
 
															+<h2><a href="javascript:toggleList('release_0.18.2_-_unreleased_')">Release 0.18.2 - Unreleased
														
 
															 </a></h2>
														
 
															+<ul id="release_0.18.2_-_unreleased_">
														
 
															+  <li><a href="javascript:toggleList('release_0.18.2_-_unreleased_._bug_fixes_')">  BUG FIXES
														
 
															+</a>&nbsp;&nbsp;&nbsp;(3)
														
 
															+    <ol id="release_0.18.2_-_unreleased_._bug_fixes_">
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4116">HADOOP-4116</a>. Balancer should provide better resource management.<br />(hairong)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3614">HADOOP-3614</a>. Fix a bug that Datanode may use an old GenerationStamp to get
														
 
															+meta file.<br />(szetszwo)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4314">HADOOP-4314</a>. Simulated datanodes should not include blocks that are still
														
 
															+being written in their block report.<br />(Raghu Angadi)</li>
														
 
															+    </ol>
														
 
															+  </li>
														
 
															+</ul>
														
 
															+<h2><a href="javascript:toggleList('older')">Older Releases</a></h2>
														
 
															+<ul id="older">
														
 
															+<h3><a href="javascript:toggleList('release_0.18.1_-_2008-09-17_')">Release 0.18.1 - 2008-09-17
														
 
															+</a></h3>
														
 
															 <ul id="release_0.18.1_-_2008-09-17_">
														
 
															   <li><a href="javascript:toggleList('release_0.18.1_-_2008-09-17_._improvements_')">  IMPROVEMENTS
														
 
															 </a>&nbsp;&nbsp;&nbsp;(1)
														
@@ -540,8 +613,6 @@ outputs or when the final map outputs are being fetched without contention.<br /
 
															     </ol>
														
 
															   </li>
														
 
															 </ul>
														
 
															-<h2><a href="javascript:toggleList('older')">Older Releases</a></h2>
														
 
															-<ul id="older">
														
 
															 <h3><a href="javascript:toggleList('release_0.18.0_-_2008-08-19_')">Release 0.18.0 - 2008-08-19
														
 
															 </a></h3>
														
 
															 <ul id="release_0.18.0_-_2008-08-19_">
														
@@ -1085,6 +1156,21 @@ cdouglas)</li>
 
															     </ol>
														
 
															   </li>
														
 
															 </ul>
														
 
															+<h3><a href="javascript:toggleList('release_0.17.3_-_unreleased_')">Release 0.17.3 - Unreleased
														
 
															+</a></h3>
														
 
															+<ul id="release_0.17.3_-_unreleased_">
														
 
															+  <li><a href="javascript:toggleList('release_0.17.3_-_unreleased_._bug_fixes_')">  BUG FIXES
														
 
															+</a>&nbsp;&nbsp;&nbsp;(4)
														
 
															+    <ol id="release_0.17.3_-_unreleased_._bug_fixes_">
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4277">HADOOP-4277</a>. Checksum verification was mistakenly disabled for
														
 
															+LocalFileSystem.<br />(Raghu Angadi)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4271">HADOOP-4271</a>. Checksum input stream can sometimes return invalid
														
 
															+data to the user.<br />(Ning Li via rangadi)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4318">HADOOP-4318</a>. DistCp should use absolute paths for cleanup.<br />(szetszwo)</li>
														
 
															+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4326">HADOOP-4326</a>. ChecksumFileSystem does not override create(...) correctly.<br />(szetszwo)</li>
														
 
															+    </ol>
														
 
															+  </li>
														
 
															+</ul>
														
 
															 <h3><a href="javascript:toggleList('release_0.17.2_-_2008-08-11_')">Release 0.17.2 - 2008-08-11
														
 
															 </a></h3>
														
 
															 <ul id="release_0.17.2_-_2008-08-11_">
														
--- a/docs/hadoop-default.html
+++ b/docs/hadoop-default.html
@@ -442,12 +442,15 @@ creations/deletions), or "all".</td>
 
															 </tr>
														
 
															 <tr>
														
 
															 <td><a name="mapred.tasktracker.taskmemorymanager.monitoring-interval">mapred.tasktracker.taskmemorymanager.monitoring-interval</a></td><td>5000</td><td>The interval, in milliseconds, for which the tasktracker waits
														
 
															-   between two cycles of monitoring its tasks' memory usage.</td>
														
 
															+   between two cycles of monitoring its tasks' memory usage. Used only if
														
 
															+   tasks' memory management is enabled via mapred.tasktracker.tasks.maxmemory.
														
 
															+   </td>
														
 
															 </tr>
														
 
															 <tr>
														
 
															 <td><a name="mapred.tasktracker.procfsbasedprocesstree.sleeptime-before-sigkill">mapred.tasktracker.procfsbasedprocesstree.sleeptime-before-sigkill</a></td><td>5000</td><td>The time, in milliseconds, the tasktracker waits for sending a
														
 
															   SIGKILL to a process that has overrun memory limits, after it has been sent
														
 
															-  a SIGTERM.</td>
														
 
															+  a SIGTERM. Used only if tasks' memory management is enabled via
														
 
															+  mapred.tasktracker.tasks.maxmemory.</td>
														
 
															 </tr>
														
 
															 <tr>
														
 
															 <td><a name="mapred.map.tasks">mapred.map.tasks</a></td><td>2</td><td>The default number of map tasks per job.  Typically set
														
@@ -467,15 +470,10 @@ creations/deletions), or "all".</td>
 
															   </td>
														
 
															 </tr>
														
 
															 <tr>
														
 
															-<td><a name="mapred.jobtracker.job.history.block.size">mapred.jobtracker.job.history.block.size</a></td><td>0</td><td>The block size of the job history file. Since the job recovery
														
 
															+<td><a name="mapred.jobtracker.job.history.block.size">mapred.jobtracker.job.history.block.size</a></td><td>3145728&gt;</td><td>The block size of the job history file. Since the job recovery
														
 
															                uses job history, its important to dump job history to disk as 
														
 
															-               soon as possible.
														
 
															-  </td>
														
 
															-</tr>
														
 
															-<tr>
														
 
															-<td><a name="mapred.jobtracker.job.history.buffer.size">mapred.jobtracker.job.history.buffer.size</a></td><td>4096</td><td>The buffer size for the job history file. Since the job 
														
 
															-               recovery uses job history, its important to frequently flush the 
														
 
															-               job history to disk. This will minimize the loss in recovery.
														
 
															+               soon as possible. Note that this is an expert level parameter.
														
 
															+               The default value is set to 3 MB.
														
 
															   </td>
														
 
															 </tr>
														
 
															 <tr>
														
@@ -914,7 +912,9 @@ creations/deletions), or "all".</td>
 
															   	tasks. Any task scheduled on this tasktracker is guaranteed and constrained
														
 
															   	 to use a share of this amount. Any task exceeding its share will be 
														
 
															   	killed. If set to -1, this functionality is disabled, and 
														
 
															-  	mapred.task.maxmemory is ignored.
														
 
															+  	mapred.task.maxmemory is ignored. Further, it will be enabled only on the
														
 
															+  	systems where org.apache.hadoop.util.ProcfsBasedProcessTree is available,
														
 
															+  	i.e at present only on Linux.
														
 
															   </td>
														
 
															 </tr>
														
 
															 <tr>
														
--- a/docs/mapred_tutorial.html
+++ b/docs/mapred_tutorial.html
@@ -319,6 +319,9 @@ document.write("Last Published: " + document.lastModified);
 
															 <li>
														
 
															 <a href="#Data+Compression">Data Compression</a>
														
 
															 </li>
														
 
															+<li>
														
 
															+<a href="#Skipping+Bad+Records">Skipping Bad Records</a>
														
 
															+</li>
														
 
															 </ul>
														
 
															 </li>
														
 
															 </ul>
														
@@ -327,7 +330,7 @@ document.write("Last Published: " + document.lastModified);
 
															 <a href="#Example%3A+WordCount+v2.0">Example: WordCount v2.0</a>
														
 
															 <ul class="minitoc">
														
 
															 <li>
														
 
															-<a href="#Source+Code-N10F30">Source Code</a>
														
 
															+<a href="#Source+Code-N10F78">Source Code</a>
														
 
															 </li>
														
 
															 <li>
														
 
															 <a href="#Sample+Runs">Sample Runs</a>
														
@@ -2542,10 +2545,81 @@ document.write("Last Published: " + document.lastModified);
 
															             <a href="api/org/apache/hadoop/mapred/SequenceFileOutputFormat.html#setOutputCompressionType(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.io.SequenceFile.CompressionType)">
														
 
															             SequenceFileOutputFormat.setOutputCompressionType(JobConf, 
														
 
															             SequenceFile.CompressionType)</a> api.</p>
														
 
															+<a name="N10F14"></a><a name="Skipping+Bad+Records"></a>
														
 
															+<h4>Skipping Bad Records</h4>
														
 
															+<p>Hadoop provides an optional mode of execution in which the bad 
														
 
															+          records are detected and skipped in further attempts. 
														
 
															+          Applications can control various settings via 
														
 
															+          <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html">
														
 
															+          SkipBadRecords</a>.</p>
														
 
															+<p>This feature can be used when map/reduce tasks crashes 
														
 
															+          deterministically on certain input. This happens due to bugs in the 
														
 
															+          map/reduce function. The usual course would be to fix these bugs. 
														
 
															+          But sometimes this is not possible; perhaps the bug is in third party 
														
 
															+          libraries for which the source code is not available. Due to this, 
														
 
															+          the task never reaches to completion even with multiple attempts and 
														
 
															+          complete data for that task is lost.</p>
														
 
															+<p>With this feature, only a small portion of data is lost surrounding 
														
 
															+          the bad record. This may be acceptable for some user applications; 
														
 
															+          for example applications which are doing statistical analysis on 
														
 
															+          very large data. By default this feature is disabled. For turning it 
														
 
															+          on refer <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setMapperMaxSkipRecords(org.apache.hadoop.conf.Configuration, long)">
														
 
															+          SkipBadRecords.setMapperMaxSkipRecords(Configuration, long)</a> and 
														
 
															+          <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setReducerMaxSkipGroups(org.apache.hadoop.conf.Configuration, long)">
														
 
															+          SkipBadRecords.setReducerMaxSkipGroups(Configuration, long)</a>.
														
 
															+          </p>
														
 
															+<p>The skipping mode gets kicked off after certain no of failures
														
 
															+          see <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setAttemptsToStartSkipping(org.apache.hadoop.conf.Configuration, int)">
														
 
															+          SkipBadRecords.setAttemptsToStartSkipping(Configuration, int)</a>.
														
 
															+          </p>
														
 
															+<p>In the skipping mode, the map/reduce task maintains the record 
														
 
															+          range which is getting processed at all times. For maintaining this 
														
 
															+          range, the framework relies on the processed record 
														
 
															+          counter. see <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#COUNTER_MAP_PROCESSED_RECORDS">
														
 
															+          SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS</a> and 
														
 
															+          <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#COUNTER_REDUCE_PROCESSED_GROUPS">
														
 
															+          SkipBadRecords.COUNTER_REDUCE_PROCESSED_GROUPS</a>. 
														
 
															+          Based on this counter, the framework knows that how 
														
 
															+          many records have been processed successfully by mapper/reducer.
														
 
															+          Before giving the 
														
 
															+          input to the map/reduce function, it sends this record range to the 
														
 
															+          Task tracker. If task crashes, the Task tracker knows which one was 
														
 
															+          the last reported range. On further attempts that range get skipped.
														
 
															+          </p>
														
 
															+<p>The number of records skipped for a single bad record depends on 
														
 
															+          how frequent, the processed counters are incremented by the application. 
														
 
															+          It is recommended to increment the counter after processing every 
														
 
															+          single record. However in some applications this might be difficult as 
														
 
															+          they may be batching up their processing. In that case, the framework 
														
 
															+          might skip more records surrounding the bad record. If users want to 
														
 
															+          reduce the number of records skipped, then they can specify the 
														
 
															+          acceptable value using 
														
 
															+          <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setMapperMaxSkipRecords(org.apache.hadoop.conf.Configuration, long)">
														
 
															+          SkipBadRecords.setMapperMaxSkipRecords(Configuration, long)</a> and 
														
 
															+          <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setReducerMaxSkipGroups(org.apache.hadoop.conf.Configuration, long)">
														
 
															+          SkipBadRecords.setReducerMaxSkipGroups(Configuration, long)</a>. 
														
 
															+          The framework tries to narrow down the skipped range by employing the 
														
 
															+          binary search kind of algorithm during task re-executions. The skipped
														
 
															+          range is divided into two halves and only one half get executed. 
														
 
															+          Based on the subsequent failure, it figures out which half contains 
														
 
															+          the bad record. This task re-execution will keep happening till 
														
 
															+          acceptable skipped value is met or all task attempts are exhausted.
														
 
															+          To increase the number of task attempts, use
														
 
															+          <a href="api/org/apache/hadoop/mapred/JobConf.html#setMaxMapAttempts(int)">
														
 
															+          JobConf.setMaxMapAttempts(int)</a> and 
														
 
															+          <a href="api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceAttempts(int)">
														
 
															+          JobConf.setMaxReduceAttempts(int)</a>.
														
 
															+          </p>
														
 
															+<p>The skipped records are written to the hdfs in the sequence file 
														
 
															+          format, which could be used for later analysis. The location of 
														
 
															+          skipped records output path can be changed by 
														
 
															+          <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setSkipOutputPath(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path)">
														
 
															+          SkipBadRecords.setSkipOutputPath(JobConf, Path)</a>.
														
 
															+          </p>
														
 
															 </div>
														
 
															-<a name="N10F16"></a><a name="Example%3A+WordCount+v2.0"></a>
														
 
															+<a name="N10F5E"></a><a name="Example%3A+WordCount+v2.0"></a>
														
 
															 <h2 class="h3">Example: WordCount v2.0</h2>
														
 
															 <div class="section">
														
 
															 <p>Here is a more complete <span class="codefrag">WordCount</span> which uses many of the
														
@@ -2555,7 +2629,7 @@ document.write("Last Published: " + document.lastModified);
 
															       <a href="quickstart.html#SingleNodeSetup">pseudo-distributed</a> or
														
 
															       <a href="quickstart.html#Fully-Distributed+Operation">fully-distributed</a> 
														
 
															       Hadoop installation.</p>
														
 
															-<a name="N10F30"></a><a name="Source+Code-N10F30"></a>
														
 
															+<a name="N10F78"></a><a name="Source+Code-N10F78"></a>
														
 
															 <h3 class="h4">Source Code</h3>
														
 
															 <table class="ForrestTable" cellspacing="1" cellpadding="4">
														
@@ -3765,7 +3839,7 @@ document.write("Last Published: " + document.lastModified);
 
															 </tr>
														
 
															 </table>
														
 
															-<a name="N11692"></a><a name="Sample+Runs"></a>
														
 
															+<a name="N116DA"></a><a name="Sample+Runs"></a>
														
 
															 <h3 class="h4">Sample Runs</h3>
														
 
															 <p>Sample text-files as input:</p>
														
 
															 <p>
														
@@ -3933,7 +4007,7 @@ document.write("Last Published: " + document.lastModified);
 
															 <br>
														
 
															 </p>
														
 
															-<a name="N11766"></a><a name="Highlights"></a>
														
 
															+<a name="N117AE"></a><a name="Highlights"></a>
														
 
															 <h3 class="h4">Highlights</h3>
														
 
															 <p>The second version of <span class="codefrag">WordCount</span> improves upon the 
														
 
															         previous one by using some features offered by the Map/Reduce framework:
														
--- a/docs/mapred_tutorial.pdf
+++ b/docs/mapred_tutorial.pdf
--- a/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml
+++ b/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml
@@ -1990,6 +1990,86 @@
 
															           </section>
														
 
															         </section>
														
 
															+        <section>
														
 
															+          <title>Skipping Bad Records</title>
														
 
															+          <p>Hadoop provides an optional mode of execution in which the bad 
														
 
															+          records are detected and skipped in further attempts. 
														
 
															+          Applications can control various settings via 
														
 
															+          <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords">
														
 
															+          SkipBadRecords</a>.</p>
														
 
															+          
														
 
															+          <p>This feature can be used when map/reduce tasks crashes 
														
 
															+          deterministically on certain input. This happens due to bugs in the 
														
 
															+          map/reduce function. The usual course would be to fix these bugs. 
														
 
															+          But sometimes this is not possible; perhaps the bug is in third party 
														
 
															+          libraries for which the source code is not available. Due to this, 
														
 
															+          the task never reaches to completion even with multiple attempts and 
														
 
															+          complete data for that task is lost.</p>
														
 
															+
														
 
															+          <p>With this feature, only a small portion of data is lost surrounding 
														
 
															+          the bad record. This may be acceptable for some user applications; 
														
 
															+          for example applications which are doing statistical analysis on 
														
 
															+          very large data. By default this feature is disabled. For turning it 
														
 
															+          on refer <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setmappermaxskiprecords">
														
 
															+          SkipBadRecords.setMapperMaxSkipRecords(Configuration, long)</a> and 
														
 
															+          <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setreducermaxskipgroups">
														
 
															+          SkipBadRecords.setReducerMaxSkipGroups(Configuration, long)</a>.
														
 
															+          </p>
														
 
															+ 
														
 
															+          <p>The skipping mode gets kicked off after certain no of failures
														
 
															+          see <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setattemptsTostartskipping">
														
 
															+          SkipBadRecords.setAttemptsToStartSkipping(Configuration, int)</a>.
														
 
															+          </p>
														
 
															+ 
														
 
															+          <p>In the skipping mode, the map/reduce task maintains the record 
														
 
															+          range which is getting processed at all times. For maintaining this 
														
 
															+          range, the framework relies on the processed record 
														
 
															+          counter. see <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/counter_map_processed_records">
														
 
															+          SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS</a> and 
														
 
															+          <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/counter_reduce_processed_groups">
														
 
															+          SkipBadRecords.COUNTER_REDUCE_PROCESSED_GROUPS</a>. 
														
 
															+          Based on this counter, the framework knows that how 
														
 
															+          many records have been processed successfully by mapper/reducer.
														
 
															+          Before giving the 
														
 
															+          input to the map/reduce function, it sends this record range to the 
														
 
															+          Task tracker. If task crashes, the Task tracker knows which one was 
														
 
															+          the last reported range. On further attempts that range get skipped.
														
 
															+          </p>
														
 
															+     
														
 
															+          <p>The number of records skipped for a single bad record depends on 
														
 
															+          how frequent, the processed counters are incremented by the application. 
														
 
															+          It is recommended to increment the counter after processing every 
														
 
															+          single record. However in some applications this might be difficult as 
														
 
															+          they may be batching up their processing. In that case, the framework 
														
 
															+          might skip more records surrounding the bad record. If users want to 
														
 
															+          reduce the number of records skipped, then they can specify the 
														
 
															+          acceptable value using 
														
 
															+          <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setmappermaxskiprecords">
														
 
															+          SkipBadRecords.setMapperMaxSkipRecords(Configuration, long)</a> and 
														
 
															+          <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setreducermaxskipgroups">
														
 
															+          SkipBadRecords.setReducerMaxSkipGroups(Configuration, long)</a>. 
														
 
															+          The framework tries to narrow down the skipped range by employing the 
														
 
															+          binary search kind of algorithm during task re-executions. The skipped
														
 
															+          range is divided into two halves and only one half get executed. 
														
 
															+          Based on the subsequent failure, it figures out which half contains 
														
 
															+          the bad record. This task re-execution will keep happening till 
														
 
															+          acceptable skipped value is met or all task attempts are exhausted.
														
 
															+          To increase the number of task attempts, use
														
 
															+          <a href="ext:api/org/apache/hadoop/mapred/jobconf/setmaxmapattempts">
														
 
															+          JobConf.setMaxMapAttempts(int)</a> and 
														
 
															+          <a href="ext:api/org/apache/hadoop/mapred/jobconf/setmaxreduceattempts">
														
 
															+          JobConf.setMaxReduceAttempts(int)</a>.
														
 
															+          </p>
														
 
															+          
														
 
															+          <p>The skipped records are written to the hdfs in the sequence file 
														
 
															+          format, which could be used for later analysis. The location of 
														
 
															+          skipped records output path can be changed by 
														
 
															+          <a href="ext:api/org/apache/hadoop/mapred/skipbadrecords/setskipoutputpath">
														
 
															+          SkipBadRecords.setSkipOutputPath(JobConf, Path)</a>.
														
 
															+          </p> 
														
 
															+
														
 
															+        </section>
														
 
															+        
														
 
															       </section>
														
 
															     </section>
														
--- a/src/docs/src/documentation/content/xdocs/site.xml
+++ b/src/docs/src/documentation/content/xdocs/site.xml
@@ -212,6 +212,14 @@ See http://forrest.apache.org/docs/linking.html for more info.
 
															                 <incrcounterString href="#incrCounter(java.lang.String, java.lang.String, long amount)" />
														
 
															               </reporter>
														
 
															               <runningjob href="RunningJob.html" />
														
 
															+              <skipbadrecords href="SkipBadRecords.html">
														
 
															+                <setmappermaxskiprecords href="#setMapperMaxSkipRecords(org.apache.hadoop.conf.Configuration, long)"/>
														
 
															+                <setreducermaxskipgroups href="#setReducerMaxSkipGroups(org.apache.hadoop.conf.Configuration, long)"/>
														
 
															+                <setattemptsTostartskipping href="#setAttemptsToStartSkipping(org.apache.hadoop.conf.Configuration, int)"/>
														
 
															+                <setskipoutputpath href="#setSkipOutputPath(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path)"/>
														
 
															+                <counter_map_processed_records href="#COUNTER_MAP_PROCESSED_RECORDS"/>
														
 
															+                <counter_reduce_processed_groups href="#COUNTER_REDUCE_PROCESSED_GROUPS"/>
														
 
															+              </skipbadrecords>
														
 
															               <textinputformat href="TextInputFormat.html" />
														
 
															               <textoutputformat href="TextOutputFormat.html" />
														
 
															               <lib href="lib/">