Przeglądaj źródła

Preparing to branch for release 0.16.0

git-svn-id: https://svn.apache.org/repos/asf/hadoop/core/trunk@616662 13f79535-47bb-0310-9956-ffa450edef68
Nigel Daley 17 lat temu
rodzic
commit
7496648a65

+ 10 - 0
CHANGES.txt

@@ -3,6 +3,16 @@ Hadoop Change Log
 
 Trunk (unreleased changes)
 
+  INCOMPATIBLE CHANGES
+
+  NEW FEATURES
+
+  OPTIMIZATIONS
+
+  BUG FIXES
+
+Release 0.16.0 - 2008-02-04
+
   INCOMPATIBLE CHANGES
 
     HADOOP-1245.  Use the mapred.tasktracker.tasks.maximum value

+ 1 - 1
build.xml

@@ -26,7 +26,7 @@
  
   <property name="Name" value="Hadoop"/>
   <property name="name" value="hadoop"/>
-  <property name="version" value="0.16.0-dev"/>
+  <property name="version" value="0.17.0-dev"/>
   <property name="final.name" value="${name}-${version}"/>
   <property name="year" value="2006"/>
   <property name="libhdfs.version" value="1"/>

+ 13 - 13
docs/cluster_setup.html

@@ -210,7 +210,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
   
     
-<a name="N1000C"></a><a name="Purpose"></a>
+<a name="N1000D"></a><a name="Purpose"></a>
 <h2 class="h3">Purpose</h2>
 <div class="section">
 <p>This document describes how to install, configure and manage non-trivial
@@ -222,7 +222,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
     
     
-<a name="N1001D"></a><a name="Pre-requisites"></a>
+<a name="N1001E"></a><a name="Pre-requisites"></a>
 <h2 class="h3">Pre-requisites</h2>
 <div class="section">
 <ol>
@@ -241,7 +241,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
     
     
-<a name="N10035"></a><a name="Installation"></a>
+<a name="N10036"></a><a name="Installation"></a>
 <h2 class="h3">Installation</h2>
 <div class="section">
 <p>Installing a Hadoop cluster typically involves unpacking the software 
@@ -257,11 +257,11 @@ document.write("Last Published: " + document.lastModified);
 </div>
     
     
-<a name="N10060"></a><a name="Configuration"></a>
+<a name="N10061"></a><a name="Configuration"></a>
 <h2 class="h3">Configuration</h2>
 <div class="section">
 <p>The following sections describe how to configure a Hadoop cluster.</p>
-<a name="N10069"></a><a name="Configuration+Files"></a>
+<a name="N1006A"></a><a name="Configuration+Files"></a>
 <h3 class="h4">Configuration Files</h3>
 <p>Hadoop configuration is driven by two important configuration files
         found in the <span class="codefrag">conf/</span> directory of the distribution:</p>
@@ -285,14 +285,14 @@ document.write("Last Published: " + document.lastModified);
 <p>Additionally, you can control the Hadoop scripts found in the 
         <span class="codefrag">bin/</span> directory of the distribution, by setting site-specific 
         values via the <span class="codefrag">conf/hadoop-env.sh</span>.</p>
-<a name="N10096"></a><a name="Site+Configuration"></a>
+<a name="N10097"></a><a name="Site+Configuration"></a>
 <h3 class="h4">Site Configuration</h3>
 <p>To configure the the Hadoop cluster you will need to configure the
         <em>environment</em> in which the Hadoop daemons execute as well as
         the <em>configuration parameters</em> for the Hadoop daemons.</p>
 <p>The Hadoop daemons are <span class="codefrag">NameNode</span>/<span class="codefrag">DataNode</span> 
         and <span class="codefrag">JobTracker</span>/<span class="codefrag">TaskTracker</span>.</p>
-<a name="N100B4"></a><a name="Configuring+the+Environment+of+the+Hadoop+Daemons"></a>
+<a name="N100B5"></a><a name="Configuring+the+Environment+of+the+Hadoop+Daemons"></a>
 <h4>Configuring the Environment of the Hadoop Daemons</h4>
 <p>Administrators should use the <span class="codefrag">conf/hadoop-env.sh</span> script
           to do site-specific customization of the Hadoop daemons' process 
@@ -318,7 +318,7 @@ document.write("Last Published: " + document.lastModified);
             </li>
           
 </ul>
-<a name="N100DC"></a><a name="Configuring+the+Hadoop+Daemons"></a>
+<a name="N100DD"></a><a name="Configuring+the+Hadoop+Daemons"></a>
 <h4>Configuring the Hadoop Daemons</h4>
 <p>This section deals with important parameters to be specified in the
           <span class="codefrag">conf/hadoop-site.xml</span> for the Hadoop cluster.</p>
@@ -442,7 +442,7 @@ document.write("Last Published: " + document.lastModified);
           <a href="api/org/apache/hadoop/conf/Configuration.html#FinalParams">
           final</a> to ensure that they cannot be overriden by user-applications.
           </p>
-<a name="N101BC"></a><a name="Real-World+Cluster+Configurations"></a>
+<a name="N101BD"></a><a name="Real-World+Cluster+Configurations"></a>
 <h5>Real-World Cluster Configurations</h5>
 <p>This section lists some non-default configuration parameters which 
             have been used to run the <em>sort</em> benchmark on very large 
@@ -603,7 +603,7 @@ document.write("Last Published: " + document.lastModified);
 </li>
             
 </ul>
-<a name="N102D9"></a><a name="Slaves"></a>
+<a name="N102DA"></a><a name="Slaves"></a>
 <h4>Slaves</h4>
 <p>Typically you choose one machine in the cluster to act as the 
           <span class="codefrag">NameNode</span> and one machine as to act as the 
@@ -612,7 +612,7 @@ document.write("Last Published: " + document.lastModified);
           referred to as <em>slaves</em>.</p>
 <p>List all slave hostnames or IP addresses in your 
           <span class="codefrag">conf/slaves</span> file, one per line.</p>
-<a name="N102F8"></a><a name="Logging"></a>
+<a name="N102F9"></a><a name="Logging"></a>
 <h4>Logging</h4>
 <p>Hadoop uses the <a href="http://logging.apache.org/log4j/">Apache 
           log4j</a> via the <a href="http://commons.apache.org/logging/">Apache 
@@ -625,7 +625,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
     
     
-<a name="N10318"></a><a name="Hadoop+Startup"></a>
+<a name="N10319"></a><a name="Hadoop+Startup"></a>
 <h2 class="h3">Hadoop Startup</h2>
 <div class="section">
 <p>To start a Hadoop cluster you will need to start both the HDFS and 
@@ -660,7 +660,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
     
     
-<a name="N1035E"></a><a name="Hadoop+Shutdown"></a>
+<a name="N1035F"></a><a name="Hadoop+Shutdown"></a>
 <h2 class="h3">Hadoop Shutdown</h2>
 <div class="section">
 <p>

+ 34 - 34
docs/hdfs_design.html

@@ -287,7 +287,7 @@ document.write("Last Published: " + document.lastModified);
 </ul>
 </div>
     
-<a name="N10013"></a><a name="Introduction"></a>
+<a name="N10014"></a><a name="Introduction"></a>
 <h2 class="h3"> Introduction </h2>
 <div class="section">
 <p>
@@ -296,35 +296,35 @@ document.write("Last Published: " + document.lastModified);
 </div>
 
     
-<a name="N10025"></a><a name="Assumptions+and+Goals"></a>
+<a name="N10026"></a><a name="Assumptions+and+Goals"></a>
 <h2 class="h3"> Assumptions and Goals </h2>
 <div class="section">
-<a name="N1002B"></a><a name="Hardware+Failure"></a>
+<a name="N1002C"></a><a name="Hardware+Failure"></a>
 <h3 class="h4"> Hardware Failure </h3>
 <p>
         Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system&rsquo;s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
        </p>
-<a name="N10035"></a><a name="Streaming+Data+Access"></a>
+<a name="N10036"></a><a name="Streaming+Data+Access"></a>
 <h3 class="h4"> Streaming Data Access </h3>
 <p>
         Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates. 
         </p>
-<a name="N1003F"></a><a name="Large+Data+Sets"></a>
+<a name="N10040"></a><a name="Large+Data+Sets"></a>
 <h3 class="h4"> Large Data Sets </h3>
 <p>
         Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
         </p>
-<a name="N10049"></a><a name="Simple+Coherency+Model"></a>
+<a name="N1004A"></a><a name="Simple+Coherency+Model"></a>
 <h3 class="h4"> Simple Coherency Model </h3>
 <p>
         HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future. 
         </p>
-<a name="N10053"></a><a name="%E2%80%9CMoving+Computation+is+Cheaper+than+Moving+Data%E2%80%9D"></a>
+<a name="N10054"></a><a name="%E2%80%9CMoving+Computation+is+Cheaper+than+Moving+Data%E2%80%9D"></a>
 <h3 class="h4"> &ldquo;Moving Computation is Cheaper than Moving Data&rdquo; </h3>
 <p>
         A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located. 
         </p>
-<a name="N1005D"></a><a name="Portability+Across+Heterogeneous+Hardware+and+Software+Platforms"></a>
+<a name="N1005E"></a><a name="Portability+Across+Heterogeneous+Hardware+and+Software+Platforms"></a>
 <h3 class="h4"> Portability Across Heterogeneous Hardware and Software Platforms </h3>
 <p>
         HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications. 
@@ -333,7 +333,7 @@ document.write("Last Published: " + document.lastModified);
 
  
     
-<a name="N10068"></a><a name="Namenode+and+Datanodes"></a>
+<a name="N10069"></a><a name="Namenode+and+Datanodes"></a>
 <h2 class="h3"> Namenode and Datanodes </h2>
 <div class="section">
 <p>
@@ -352,7 +352,7 @@ document.write("Last Published: " + document.lastModified);
  
 
     
-<a name="N10089"></a><a name="The+File+System+Namespace"></a>
+<a name="N1008A"></a><a name="The+File+System+Namespace"></a>
 <h2 class="h3"> The File System Namespace </h2>
 <div class="section">
 <p>
@@ -366,7 +366,7 @@ document.write("Last Published: " + document.lastModified);
  
 
     
-<a name="N10096"></a><a name="Data+Replication"></a>
+<a name="N10097"></a><a name="Data+Replication"></a>
 <h2 class="h3"> Data Replication </h2>
 <div class="section">
 <p>
@@ -377,7 +377,7 @@ document.write("Last Published: " + document.lastModified);
     </p>
 <div id="" style="text-align: center;">
 <img id="" class="figure" alt="HDFS Datanodes" src="images/hdfsdatanodes.gif"></div>
-<a name="N100AC"></a><a name="Replica+Placement%3A+The+First+Baby+Steps"></a>
+<a name="N100AD"></a><a name="Replica+Placement%3A+The+First+Baby+Steps"></a>
 <h3 class="h4"> Replica Placement: The First Baby Steps </h3>
 <p>
         The placement of replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. The current implementation for the replica placement policy is a first effort in this direction. The short-term goals of implementing this policy are to validate it on production systems, learn more about its behavior, and build a foundation to test and research more sophisticated policies. 
@@ -394,12 +394,12 @@ document.write("Last Published: " + document.lastModified);
 <p>
         The current, default replica placement policy described here is a work in progress.
         </p>
-<a name="N100C6"></a><a name="Replica+Selection"></a>
+<a name="N100C7"></a><a name="Replica+Selection"></a>
 <h3 class="h4"> Replica Selection </h3>
 <p>
         To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request. If angg/ HDFS cluster spans multiple data centers, then a replica that is resident in the local data center is preferred over any remote replica.
         </p>
-<a name="N100D0"></a><a name="SafeMode"></a>
+<a name="N100D1"></a><a name="SafeMode"></a>
 <h3 class="h4"> SafeMode </h3>
 <p>
         On startup, the Namenode enters a special state called <em>Safemode</em>. Replication of data blocks does not occur when the Namenode is in the Safemode state. The Namenode receives Heartbeat and Blockreport messages from the Datanodes. A Blockreport contains the list of data blocks that a Datanode is hosting. Each block has a specified minimum number of replicas. A block is considered <em>safely replicated</em> when the minimum number of replicas of that data block has checked in with the Namenode. After a configurable percentage of safely replicated data blocks checks in with the Namenode (plus an additional 30 seconds), the Namenode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The Namenode then replicates these blocks to other Datanodes.
@@ -407,7 +407,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
 
     
-<a name="N100E1"></a><a name="The+Persistence+of+File+System+Metadata"></a>
+<a name="N100E2"></a><a name="The+Persistence+of+File+System+Metadata"></a>
 <h2 class="h3"> The Persistence of File System Metadata </h2>
 <div class="section">
 <p>
@@ -423,7 +423,7 @@ document.write("Last Published: " + document.lastModified);
 
 
     
-<a name="N10103"></a><a name="The+Communication+Protocols"></a>
+<a name="N10104"></a><a name="The+Communication+Protocols"></a>
 <h2 class="h3"> The Communication Protocols </h2>
 <div class="section">
 <p>
@@ -433,29 +433,29 @@ document.write("Last Published: " + document.lastModified);
  
 
     
-<a name="N1011B"></a><a name="Robustness"></a>
+<a name="N1011C"></a><a name="Robustness"></a>
 <h2 class="h3"> Robustness </h2>
 <div class="section">
 <p>
       The primary objective of HDFS is to store data reliably even in the presence of failures. The three common types of failures are Namenode failures, Datanode failures and network partitions.
       </p>
-<a name="N10124"></a><a name="Data+Disk+Failure%2C+Heartbeats+and+Re-Replication"></a>
+<a name="N10125"></a><a name="Data+Disk+Failure%2C+Heartbeats+and+Re-Replication"></a>
 <h3 class="h4"> Data Disk Failure, Heartbeats and Re-Replication </h3>
 <p>
         Each Datanode sends a Heartbeat message to the Namenode periodically. A network partition can cause a subset of Datanodes to lose connectivity with the Namenode. The Namenode detects this condition by the absence of a Heartbeat message. The Namenode marks Datanodes without recent Heartbeats as dead and does not forward any new <acronym title="Input/Output">IO</acronym> requests to them. Any data that was registered to a dead Datanode is not available to HDFS any more. Datanode death may cause the replication factor of some blocks to fall below their specified value. The Namenode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a Datanode may become unavailable, a replica may become corrupted, a hard disk on a Datanode may fail, or the replication factor of a file may be increased. 
         </p>
-<a name="N10132"></a><a name="Cluster+Rebalancing"></a>
+<a name="N10133"></a><a name="Cluster+Rebalancing"></a>
 <h3 class="h4"> Cluster Rebalancing </h3>
 <p>
         The HDFS architecture is compatible with <em>data rebalancing schemes</em>. A scheme might automatically move data from one Datanode to another if the free space on a Datanode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented. 
         </p>
-<a name="N1013F"></a><a name="Data+Integrity"></a>
+<a name="N10140"></a><a name="Data+Integrity"></a>
 <h3 class="h4"> Data Integrity </h3>
 <p>
         <!-- XXX "checksum checking" sounds funny -->
         It is possible that a block of data fetched from a Datanode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each Datanode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another Datanode that has a replica of that block.
         </p>
-<a name="N1014B"></a><a name="Metadata+Disk+Failure"></a>
+<a name="N1014C"></a><a name="Metadata+Disk+Failure"></a>
 <h3 class="h4"> Metadata Disk Failure </h3>
 <p>
         The FsImage and the EditLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the Namenode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a Namenode can support. However, this degradation is acceptable because even though HDFS applications are very <em>data</em> intensive in nature, they are not <em>metadata</em> intensive. When a Namenode restarts, it selects the latest consistent FsImage and EditLog to use.
@@ -463,7 +463,7 @@ document.write("Last Published: " + document.lastModified);
 <p> 
         The Namenode machine is a single point of failure for an HDFS cluster. If the Namenode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the Namenode software to another machine is not supported.
         </p>
-<a name="N1015E"></a><a name="Snapshots"></a>
+<a name="N1015F"></a><a name="Snapshots"></a>
 <h3 class="h4"> Snapshots </h3>
 <p>
         Snapshots support storing a copy of data at a particular instant of time. One usage of the snapshot feature may be to roll back a corrupted HDFS instance to a previously known good point in time. HDFS does not currently support snapshots but will in a future release.
@@ -472,15 +472,15 @@ document.write("Last Published: " + document.lastModified);
  
 
     
-<a name="N10169"></a><a name="Data+Organization"></a>
+<a name="N1016A"></a><a name="Data+Organization"></a>
 <h2 class="h3"> Data Organization </h2>
 <div class="section">
-<a name="N10171"></a><a name="Data+Blocks"></a>
+<a name="N10172"></a><a name="Data+Blocks"></a>
 <h3 class="h4"> Data Blocks </h3>
 <p>
         HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files. A typical block size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a different Datanode.
         </p>
-<a name="N1017B"></a><a name="Staging"></a>
+<a name="N1017C"></a><a name="Staging"></a>
 <h3 class="h4"> Staging </h3>
 <p>
         A client request to create a file does not reach the Namenode immediately. In fact, initially the HDFS client caches the file data into a temporary local file. Application writes are transparently redirected to this temporary local file. When the local file accumulates data worth over one HDFS block size, the client contacts the Namenode. The Namenode inserts the file name into the file system hierarchy and allocates a data block for it. The Namenode responds to the client request with the identity of the Datanode and the destination data block. Then the client flushes the block of data from the local temporary file to the specified Datanode. When a file is closed, the remaining un-flushed data in the temporary local file is transferred to the Datanode. The client then tells the Namenode that the file is closed. At this point, the Namenode commits the file creation operation into a persistent store. If the Namenode dies before the file is closed, the file is lost. 
@@ -488,7 +488,7 @@ document.write("Last Published: " + document.lastModified);
 <p>
         The above approach has been adopted after careful consideration of target applications that run on HDFS. These applications need streaming writes to files. If a client writes to a remote file directly without any client side buffering, the network speed and the congestion in the network impacts throughput considerably. This approach is not without precedent. Earlier distributed file systems, e.g. <acronym title="Andrew File System">AFS</acronym>, have used client side caching to improve performance. A POSIX requirement has been relaxed to achieve higher performance of data uploads. 
         </p>
-<a name="N1018E"></a><a name="Replication+Pipelining"></a>
+<a name="N1018F"></a><a name="Replication+Pipelining"></a>
 <h3 class="h4"> Replication Pipelining </h3>
 <p>
         When a client is writing data to an HDFS file, its data is first written to a local file as explained in the previous section. Suppose the HDFS file has a replication factor of three. When the local file accumulates a full block of user data, the client retrieves a list of Datanodes from the Namenode. This list contains the Datanodes that will host a replica of that block. The client then flushes the data block to the first Datanode. The first Datanode starts receiving the data in small portions (4 KB), writes each portion to its local repository and transfers that portion to the second Datanode in the list. The second Datanode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third Datanode. Finally, the third Datanode writes the data to its local repository. Thus, a Datanode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one Datanode to the next.
@@ -496,13 +496,13 @@ document.write("Last Published: " + document.lastModified);
 </div>
 
     
-<a name="N10199"></a><a name="Accessibility"></a>
+<a name="N1019A"></a><a name="Accessibility"></a>
 <h2 class="h3"> Accessibility </h2>
 <div class="section">
 <p>
       HDFS can be accessed from applications in many different ways. Natively, HDFS provides a <a href="http://hadoop.apache.org/core/docs/current/api/">Java API</a> for applications to use. A C language wrapper for this Java API is also available. In addition, an HTTP browser can also be used to browse the files of an HDFS instance. Work is in progress to expose HDFS through the <acronym title="Web-based Distributed Authoring and Versioning">WebDAV</acronym> protocol. 
       </p>
-<a name="N101AE"></a><a name="DFSShell"></a>
+<a name="N101AF"></a><a name="DFSShell"></a>
 <h3 class="h4"> DFSShell </h3>
 <p>
         HDFS allows user data to be organized in the form of files and directories. It provides a commandline interface called <em>DFSShell</em> that lets a user interact with the data in HDFS. The syntax of this command set is similar to other shells (e.g. bash, csh) that users are already familiar with. Here are some sample action/command pairs:
@@ -537,7 +537,7 @@ document.write("Last Published: " + document.lastModified);
 <p>
         DFSShell is targeted for applications that need a scripting language to interact with the stored data.
         </p>
-<a name="N10206"></a><a name="DFSAdmin"></a>
+<a name="N10207"></a><a name="DFSAdmin"></a>
 <h3 class="h4"> DFSAdmin </h3>
 <p>
         The <em>DFSAdmin</em> command set is used for administering an HDFS cluster. These are commands that are used only by an HDFS administrator. Here are some sample action/command pairs:
@@ -569,7 +569,7 @@ document.write("Last Published: " + document.lastModified);
 </tr>
         
 </table>
-<a name="N10254"></a><a name="Browser+Interface"></a>
+<a name="N10255"></a><a name="Browser+Interface"></a>
 <h3 class="h4"> Browser Interface </h3>
 <p>
         A typical HDFS install configures a web server to expose the HDFS namespace through a configurable TCP port. This allows a user to navigate the HDFS namespace and view the contents of its files using a web browser.
@@ -577,10 +577,10 @@ document.write("Last Published: " + document.lastModified);
 </div> 
 
     
-<a name="N1025F"></a><a name="Space+Reclamation"></a>
+<a name="N10260"></a><a name="Space+Reclamation"></a>
 <h2 class="h3"> Space Reclamation </h2>
 <div class="section">
-<a name="N10265"></a><a name="File+Deletes+and+Undeletes"></a>
+<a name="N10266"></a><a name="File+Deletes+and+Undeletes"></a>
 <h3 class="h4"> File Deletes and Undeletes </h3>
 <p>
         When a file is deleted by a user or an application, it is not immediately removed from HDFS.  Instead, HDFS first renames it to a file in the <span class="codefrag">/trash</span> directory. The file can be restored quickly as long as it remains in <span class="codefrag">/trash</span>. A file remains in <span class="codefrag">/trash</span> for a configurable amount of time. After the expiry of its life in <span class="codefrag">/trash</span>, the Namenode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.
@@ -588,7 +588,7 @@ document.write("Last Published: " + document.lastModified);
 <p>
         A user can Undelete a file after deleting it as long as it remains in the <span class="codefrag">/trash</span> directory. If a user wants to undelete a file that he/she has deleted, he/she can navigate the <span class="codefrag">/trash</span> directory and retrieve the file. The <span class="codefrag">/trash</span> directory contains only the latest copy of the file that was deleted. The <span class="codefrag">/trash</span> directory is just like any other directory with one special feature: HDFS applies specified policies to automatically delete files from this directory. The current default policy is to delete files from <span class="codefrag">/trash</span> that are more than 6 hours old. In the future, this policy will be configurable through a well defined interface.
         </p>
-<a name="N1028D"></a><a name="Decrease+Replication+Factor"></a>
+<a name="N1028E"></a><a name="Decrease+Replication+Factor"></a>
 <h3 class="h4"> Decrease Replication Factor </h3>
 <p>
         When the replication factor of a file is reduced, the Namenode selects excess replicas that can be deleted. The next Heartbeat transfers this information to the Datanode. The Datanode then removes the corresponding blocks and the corresponding free space appears in the cluster. Once again, there might be a time delay between the completion of the <span class="codefrag">setReplication</span> API call and the appearance of free space in the cluster.
@@ -597,7 +597,7 @@ document.write("Last Published: " + document.lastModified);
 
 
     
-<a name="N1029B"></a><a name="References"></a>
+<a name="N1029C"></a><a name="References"></a>
 <h2 class="h3"> References </h2>
 <div class="section">
 <p>

+ 15 - 15
docs/hdfs_user_guide.html

@@ -220,7 +220,7 @@ document.write("Last Published: " + document.lastModified);
 </ul>
 </div>
     
-<a name="N1000C"></a><a name="Purpose"></a>
+<a name="N1000D"></a><a name="Purpose"></a>
 <h2 class="h3">Purpose</h2>
 <div class="section">
 <p>
@@ -235,7 +235,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
 
     
-<a name="N1001A"></a><a name="Overview"></a>
+<a name="N1001B"></a><a name="Overview"></a>
 <h2 class="h3"> Overview </h2>
 <div class="section">
 <p>
@@ -341,7 +341,7 @@ document.write("Last Published: " + document.lastModified);
     
 </ul>
 </div> 
-<a name="N10082"></a><a name="Pre-requisites"></a>
+<a name="N10083"></a><a name="Pre-requisites"></a>
 <h2 class="h3"> Pre-requisites </h2>
 <div class="section">
 <p>
@@ -370,7 +370,7 @@ document.write("Last Published: " + document.lastModified);
  	machine.	
     </p>
 </div> 
-<a name="N100A0"></a><a name="Web+Interface"></a>
+<a name="N100A1"></a><a name="Web+Interface"></a>
 <h2 class="h3"> Web Interface </h2>
 <div class="section">
 <p>
@@ -384,7 +384,7 @@ document.write("Last Published: " + document.lastModified);
  	page).
  </p>
 </div> 
-<a name="N100AD"></a><a name="Shell+Commands"></a>
+<a name="N100AE"></a><a name="Shell+Commands"></a>
 <h2 class="h3">Shell Commands</h2>
 <div class="section">
 <p>
@@ -400,7 +400,7 @@ document.write("Last Published: " + document.lastModified);
       changing file permissions, etc. It also supports a few HDFS
       specific operations like changing replication of files.
      </p>
-<a name="N100BC"></a><a name="DFSAdmin+Command"></a>
+<a name="N100BD"></a><a name="DFSAdmin+Command"></a>
 <h3 class="h4"> DFSAdmin Command </h3>
 <p>
    	
@@ -433,7 +433,7 @@ document.write("Last Published: " + document.lastModified);
    	
 </ul>
 </div> 
-<a name="N100E5"></a><a name="Secondary+Namenode"></a>
+<a name="N100E6"></a><a name="Secondary+Namenode"></a>
 <h2 class="h3"> Secondary Namenode </h2>
 <div class="section">
 <p>
@@ -458,7 +458,7 @@ document.write("Last Published: " + document.lastModified);
      specified in <span class="codefrag">conf/masters</span> file.
    </p>
 </div> 
-<a name="N1010A"></a><a name="Rebalancer"></a>
+<a name="N1010B"></a><a name="Rebalancer"></a>
 <h2 class="h3"> Rebalancer </h2>
 <div class="section">
 <p>
@@ -503,7 +503,7 @@ document.write("Last Published: " + document.lastModified);
       <a href="http://issues.apache.org/jira/browse/HADOOP-1652">HADOOP-1652</a>.
     </p>
 </div> 
-<a name="N10131"></a><a name="Rack+Awareness"></a>
+<a name="N10132"></a><a name="Rack+Awareness"></a>
 <h2 class="h3"> Rack Awareness </h2>
 <div class="section">
 <p>
@@ -522,7 +522,7 @@ document.write("Last Published: " + document.lastModified);
       <a href="http://issues.apache.org/jira/browse/HADOOP-692">HADOOP-692</a>.
     </p>
 </div> 
-<a name="N1014F"></a><a name="Safemode"></a>
+<a name="N10150"></a><a name="Safemode"></a>
 <h2 class="h3"> Safemode </h2>
 <div class="section">
 <p>
@@ -542,7 +542,7 @@ document.write("Last Published: " + document.lastModified);
       <a href="http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/dfs/NameNode.html#setSafeMode(org.apache.hadoop.dfs.FSConstants.SafeModeAction)"><span class="codefrag">setSafeMode()</span></a>.
     </p>
 </div> 
-<a name="N1016D"></a><a name="Fsck"></a>
+<a name="N1016E"></a><a name="Fsck"></a>
 <h2 class="h3"> Fsck </h2>
 <div class="section">
 <p>    
@@ -558,7 +558,7 @@ document.write("Last Published: " + document.lastModified);
       Fsck can be run on the whole filesystem or on a subset of files.
      </p>
 </div> 
-<a name="N1017D"></a><a name="Upgrade+and+Rollback"></a>
+<a name="N1017E"></a><a name="Upgrade+and+Rollback"></a>
 <h2 class="h3"> Upgrade and Rollback </h2>
 <div class="section">
 <p>
@@ -617,7 +617,7 @@ document.write("Last Published: " + document.lastModified);
       
 </ul>
 </div> 
-<a name="N101BE"></a><a name="File+Permissions+and+Security"></a>
+<a name="N101BF"></a><a name="File+Permissions+and+Security"></a>
 <h2 class="h3"> File Permissions and Security </h2>
 <div class="section">
 <p>           
@@ -629,7 +629,7 @@ document.write("Last Published: " + document.lastModified);
       authentication and encryption of data transfers.
      </p>
 </div> 
-<a name="N101CB"></a><a name="Scalability"></a>
+<a name="N101CC"></a><a name="Scalability"></a>
 <h2 class="h3"> Scalability </h2>
 <div class="section">
 <p>
@@ -647,7 +647,7 @@ document.write("Last Published: " + document.lastModified);
       suggested configuration improvements for large Hadoop clusters.
      </p>
 </div> 
-<a name="N101DD"></a><a name="Related+Documentation"></a>
+<a name="N101DE"></a><a name="Related+Documentation"></a>
 <h2 class="h3"> Related Documentation </h2>
 <div class="section">
 <p>

+ 35 - 35
docs/hod.html

@@ -294,7 +294,7 @@ document.write("Last Published: " + document.lastModified);
 </ul>
 </div>
     
-<a name="N1000C"></a><a name="Introduction"></a>
+<a name="N1000D"></a><a name="Introduction"></a>
 <h2 class="h3"> Introduction </h2>
 <div class="section">
 <p>
@@ -303,30 +303,30 @@ document.write("Last Published: " + document.lastModified);
 </div>
 
     
-<a name="N1001E"></a><a name="Feature+List"></a>
+<a name="N1001F"></a><a name="Feature+List"></a>
 <h2 class="h3"> Feature List </h2>
 <div class="section">
-<a name="N10024"></a><a name="Simplified+Interface+for+Provisioning+Hadoop+Clusters"></a>
+<a name="N10025"></a><a name="Simplified+Interface+for+Provisioning+Hadoop+Clusters"></a>
 <h3 class="h4"> Simplified Interface for Provisioning Hadoop Clusters </h3>
 <p>
         By far, the biggest advantage of HOD is to quickly setup a Hadoop cluster. The user interacts with the cluster through a simple command line interface, the HOD client. HOD brings up a virtual MapReduce cluster with the required number of nodes, which the user can use for running Hadoop jobs. When done, HOD will automatically clean up the resources and make the nodes available again.
         </p>
-<a name="N1002E"></a><a name="Automatic+installation+of+Hadoop"></a>
+<a name="N1002F"></a><a name="Automatic+installation+of+Hadoop"></a>
 <h3 class="h4"> Automatic installation of Hadoop </h3>
 <p>
         With HOD, Hadoop does not need to be even installed on the cluster. The user can provide a Hadoop tarball that HOD will automatically distribute to all the nodes in the cluster.
         </p>
-<a name="N10038"></a><a name="Configuring+Hadoop"></a>
+<a name="N10039"></a><a name="Configuring+Hadoop"></a>
 <h3 class="h4"> Configuring Hadoop </h3>
 <p>
         Dynamic parameters of Hadoop configuration, such as the NameNode and JobTracker addresses and ports, and file system temporary directories are generated and distributed by HOD automatically to all nodes in the cluster. In addition, HOD allows the user to configure Hadoop parameters at both the server (for e.g. JobTracker) and client (for e.g. JobClient) level, including 'final' parameters, that were introduced with Hadoop 0.15.
         </p>
-<a name="N10042"></a><a name="Auto-cleanup+of+Unused+Clusters"></a>
+<a name="N10043"></a><a name="Auto-cleanup+of+Unused+Clusters"></a>
 <h3 class="h4"> Auto-cleanup of Unused Clusters </h3>
 <p>
         HOD has an automatic timeout so that users cannot misuse resources they aren't using. The timeout applies only when there is no MapReduce job running. 
         </p>
-<a name="N1004C"></a><a name="Log+Services"></a>
+<a name="N1004D"></a><a name="Log+Services"></a>
 <h3 class="h4"> Log Services </h3>
 <p>
         HOD can be used to collect all MapReduce logs to a central location for archiving and inspection after the job is completed.
@@ -334,13 +334,13 @@ document.write("Last Published: " + document.lastModified);
 </div>
 
     
-<a name="N10057"></a><a name="HOD+Components"></a>
+<a name="N10058"></a><a name="HOD+Components"></a>
 <h2 class="h3"> HOD Components </h2>
 <div class="section">
 <p>
       This is a brief overview of the various components of HOD and how they interact to provision Hadoop.
       </p>
-<a name="N10060"></a><a name="HOD+Client"></a>
+<a name="N10061"></a><a name="HOD+Client"></a>
 <h3 class="h4"> HOD Client </h3>
 <p>
         The HOD client is a Unix command that users use to allocate Hadoop MapReduce clusters. The command provides other options to list allocated clusters and deallocate them. The HOD client generates the <em>hadoop-site.xml</em> in a user specified directory. The user can point to this configuration file while running Map/Reduce jobs on the allocated cluster.
@@ -348,7 +348,7 @@ document.write("Last Published: " + document.lastModified);
 <p>
         The nodes from where the HOD Client is run are called <em>submit nodes</em> because jobs are submitted to the resource manager system for allocating and running clusters from these nodes.
         </p>
-<a name="N10073"></a><a name="RingMaster"></a>
+<a name="N10074"></a><a name="RingMaster"></a>
 <h3 class="h4"> RingMaster </h3>
 <p>
         The RingMaster is a HOD process that is started on one node per every allocated cluster. It is submitted as a 'job' to the resource manager by the HOD client. It controls which Hadoop daemons start on which nodes. It provides this information to other HOD processes, such as the HOD client, so users can also determine this information. The RingMaster is responsible for hosting and distributing the Hadoop tarball to all nodes in the cluster. It also automatically cleans up unused clusters.
@@ -356,17 +356,17 @@ document.write("Last Published: " + document.lastModified);
 <p>
         
 </p>
-<a name="N10080"></a><a name="HodRing"></a>
+<a name="N10081"></a><a name="HodRing"></a>
 <h3 class="h4"> HodRing </h3>
 <p>
         The HodRing is a HOD process that runs on every allocated node in the cluster. These processes are run by the RingMaster through the resource manager, using a facility of parallel execution. The HodRings are responsible for launching Hadoop commands on the nodes to bring up the Hadoop daemons. They get the command to launch from the RingMaster.
         </p>
-<a name="N1008A"></a><a name="Hodrc+%2F+HOD+configuration+file"></a>
+<a name="N1008B"></a><a name="Hodrc+%2F+HOD+configuration+file"></a>
 <h3 class="h4"> Hodrc / HOD configuration file </h3>
 <p>
         An INI style configuration file where the users configure various options for the HOD system, including install locations of different software, resource manager parameters, log and temp file directories, parameters for their MapReduce jobs, etc.
         </p>
-<a name="N10094"></a><a name="Submit+Nodes+and+Compute+Nodes"></a>
+<a name="N10095"></a><a name="Submit+Nodes+and+Compute+Nodes"></a>
 <h3 class="h4"> Submit Nodes and Compute Nodes </h3>
 <p>
         The nodes from where the <em>HOD Client</em> is run are referred as <em>submit nodes</em> because jobs are submitted to the resource manager system for allocating and running clusters from these nodes.
@@ -377,17 +377,17 @@ document.write("Last Published: " + document.lastModified);
 </div>
 
     
-<a name="N100AE"></a><a name="Getting+Started+with+HOD"></a>
+<a name="N100AF"></a><a name="Getting+Started+with+HOD"></a>
 <h2 class="h3"> Getting Started with HOD </h2>
 <div class="section">
-<a name="N100B4"></a><a name="Pre-Requisites"></a>
+<a name="N100B5"></a><a name="Pre-Requisites"></a>
 <h3 class="h4"> Pre-Requisites </h3>
-<a name="N100BA"></a><a name="Hardware"></a>
+<a name="N100BB"></a><a name="Hardware"></a>
 <h4> Hardware </h4>
 <p>
           HOD requires a minimum of 3 nodes configured through a resource manager.
           </p>
-<a name="N100C4"></a><a name="Software"></a>
+<a name="N100C5"></a><a name="Software"></a>
 <h4> Software </h4>
 <p>
           The following components are assumed to be installed before using HOD:
@@ -424,7 +424,7 @@ document.write("Last Published: " + document.lastModified);
 <p>
           HOD configuration requires the location of installs of these components to be the same on all nodes in the cluster. It will also make the configuration simpler to have the same location on the submit nodes.
           </p>
-<a name="N100FE"></a><a name="Resource+Manager+Configuration+Pre-requisites"></a>
+<a name="N100FF"></a><a name="Resource+Manager+Configuration+Pre-requisites"></a>
 <h4>Resource Manager Configuration Pre-requisites</h4>
 <p>
           For using HOD with Torque:
@@ -456,7 +456,7 @@ document.write("Last Published: " + document.lastModified);
           More information about setting up Torque can be found by referring to the documentation <a href="http://www.clusterresources.com/pages/products/torque-resource-manager.php">here.</a>
           
 </p>
-<a name="N10125"></a><a name="Setting+up+HOD"></a>
+<a name="N10126"></a><a name="Setting+up+HOD"></a>
 <h3 class="h4">Setting up HOD</h3>
 <ul>
           
@@ -550,15 +550,15 @@ document.write("Last Published: " + document.lastModified);
 </div>
 
     
-<a name="N101B3"></a><a name="Running+HOD"></a>
+<a name="N101B4"></a><a name="Running+HOD"></a>
 <h2 class="h3">Running HOD</h2>
 <div class="section">
-<a name="N101B9"></a><a name="Overview"></a>
+<a name="N101BA"></a><a name="Overview"></a>
 <h3 class="h4">Overview</h3>
 <p>
         A typical session of HOD will involve atleast three steps: allocate, run hadoop jobs, deallocate.
         </p>
-<a name="N101C2"></a><a name="Operation+allocate"></a>
+<a name="N101C3"></a><a name="Operation+allocate"></a>
 <h4>Operation allocate</h4>
 <p>
           The allocate operation is used to allocate a set of nodes and install and provision Hadoop on them. It has the following syntax:
@@ -605,7 +605,7 @@ document.write("Last Published: " + document.lastModified);
 </tr>
           
 </table>
-<a name="N10202"></a><a name="Running+Hadoop+jobs+using+the+allocated+cluster"></a>
+<a name="N10203"></a><a name="Running+Hadoop+jobs+using+the+allocated+cluster"></a>
 <h4>Running Hadoop jobs using the allocated cluster</h4>
 <p>
           Now, one can run Hadoop jobs using the allocated cluster in the usual manner:
@@ -631,7 +631,7 @@ document.write("Last Published: " + document.lastModified);
 </tr>
           
 </table>
-<a name="N10225"></a><a name="Operation+deallocate"></a>
+<a name="N10226"></a><a name="Operation+deallocate"></a>
 <h4>Operation deallocate</h4>
 <p>
           The deallocate operation is used to release an allocated cluster. When finished with a cluster, deallocate must be run so that the nodes become free for others to use. The deallocate operation has the following syntax:
@@ -657,7 +657,7 @@ document.write("Last Published: " + document.lastModified);
 </tr>
           
 </table>
-<a name="N10249"></a><a name="Command+Line+Options"></a>
+<a name="N1024A"></a><a name="Command+Line+Options"></a>
 <h3 class="h4">Command Line Options</h3>
 <p>
         This section covers the major command line options available via the hod command:
@@ -768,10 +768,10 @@ document.write("Last Published: " + document.lastModified);
         </p>
 </div>
     
-<a name="N102C9"></a><a name="HOD+Configuration"></a>
+<a name="N102CA"></a><a name="HOD+Configuration"></a>
 <h2 class="h3"> HOD Configuration </h2>
 <div class="section">
-<a name="N102CF"></a><a name="Introduction+to+HOD+Configuration"></a>
+<a name="N102D0"></a><a name="Introduction+to+HOD+Configuration"></a>
 <h3 class="h4"> Introduction to HOD Configuration </h3>
 <p>
         Configuration options for HOD are organized as sections and options within them. They can be specified in two ways: a configuration file in the INI format, and as command line options to the HOD shell, specified in the format --section.option[=value]. If the same option is specified in both places, the value specified on the command line overrides the value in the configuration file.
@@ -783,7 +783,7 @@ document.write("Last Published: " + document.lastModified);
 <p>
         This section explains some of the most important or commonly used configuration options in some more detail.
         </p>
-<a name="N102E2"></a><a name="Categories+%2F+Sections+in+HOD+Configuration"></a>
+<a name="N102E3"></a><a name="Categories+%2F+Sections+in+HOD+Configuration"></a>
 <h3 class="h4"> Categories / Sections in HOD Configuration </h3>
 <p>
         The following are the various sections in the HOD configuration:
@@ -840,9 +840,9 @@ document.write("Last Published: " + document.lastModified);
 </tr>
         
 </table>
-<a name="N1034A"></a><a name="Important+and+Commonly+Used+Configuration+Options"></a>
+<a name="N1034B"></a><a name="Important+and+Commonly+Used+Configuration+Options"></a>
 <h3 class="h4"> Important and Commonly Used Configuration Options </h3>
-<a name="N10350"></a><a name="Common+configuration+options"></a>
+<a name="N10351"></a><a name="Common+configuration+options"></a>
 <h4> Common configuration options </h4>
 <p>
           Certain configuration options are defined in most of the sections of the HOD configuration. Options defined in a section, are used by the process for which that section applies. These options have the same meaning, but can have different values in each section.
@@ -892,7 +892,7 @@ document.write("Last Published: " + document.lastModified);
 </tr>
           
 </table>
-<a name="N103AE"></a><a name="hod+options"></a>
+<a name="N103AF"></a><a name="hod+options"></a>
 <h4> hod options </h4>
 <table class="ForrestTable" cellspacing="1" cellpadding="4">
             
@@ -918,7 +918,7 @@ document.write("Last Published: " + document.lastModified);
 </tr>
           
 </table>
-<a name="N103DF"></a><a name="resource_manager+options"></a>
+<a name="N103E0"></a><a name="resource_manager+options"></a>
 <h4> resource_manager options </h4>
 <table class="ForrestTable" cellspacing="1" cellpadding="4">
             
@@ -951,7 +951,7 @@ document.write("Last Published: " + document.lastModified);
 </tr>
           
 </table>
-<a name="N1041D"></a><a name="ringmaster+options"></a>
+<a name="N1041E"></a><a name="ringmaster+options"></a>
 <h4> ringmaster options </h4>
 <table class="ForrestTable" cellspacing="1" cellpadding="4">
             
@@ -970,7 +970,7 @@ document.write("Last Published: " + document.lastModified);
 </tr>
           
 </table>
-<a name="N10441"></a><a name="gridservice-hdfs+options"></a>
+<a name="N10442"></a><a name="gridservice-hdfs+options"></a>
 <h4> gridservice-hdfs options </h4>
 <table class="ForrestTable" cellspacing="1" cellpadding="4">
             
@@ -1037,7 +1037,7 @@ document.write("Last Published: " + document.lastModified);
 </tr>
           
 </table>
-<a name="N104B9"></a><a name="gridservice-mapred+options"></a>
+<a name="N104BA"></a><a name="gridservice-mapred+options"></a>
 <h4> gridservice-mapred options </h4>
 <table class="ForrestTable" cellspacing="1" cellpadding="4">
             

+ 46 - 46
docs/mapred_tutorial.html

@@ -280,7 +280,7 @@ document.write("Last Published: " + document.lastModified);
 <a href="#Example%3A+WordCount+v2.0">Example: WordCount v2.0</a>
 <ul class="minitoc">
 <li>
-<a href="#Source+Code-N10BBD">Source Code</a>
+<a href="#Source+Code-N10BBE">Source Code</a>
 </li>
 <li>
 <a href="#Sample+Runs">Sample Runs</a>
@@ -294,7 +294,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
   
     
-<a name="N1000C"></a><a name="Purpose"></a>
+<a name="N1000D"></a><a name="Purpose"></a>
 <h2 class="h3">Purpose</h2>
 <div class="section">
 <p>This document comprehensively describes all user-facing facets of the 
@@ -303,7 +303,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
     
     
-<a name="N10016"></a><a name="Pre-requisites"></a>
+<a name="N10017"></a><a name="Pre-requisites"></a>
 <h2 class="h3">Pre-requisites</h2>
 <div class="section">
 <p>Ensure that Hadoop is installed, configured and is running. More
@@ -323,7 +323,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
     
     
-<a name="N10031"></a><a name="Overview"></a>
+<a name="N10032"></a><a name="Overview"></a>
 <h2 class="h3">Overview</h2>
 <div class="section">
 <p>Hadoop Map-Reduce is a software framework for easily writing 
@@ -381,7 +381,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
     
     
-<a name="N1008A"></a><a name="Inputs+and+Outputs"></a>
+<a name="N1008B"></a><a name="Inputs+and+Outputs"></a>
 <h2 class="h3">Inputs and Outputs</h2>
 <div class="section">
 <p>The Map-Reduce framework operates exclusively on 
@@ -415,7 +415,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
 
     
-<a name="N100CC"></a><a name="Example%3A+WordCount+v1.0"></a>
+<a name="N100CD"></a><a name="Example%3A+WordCount+v1.0"></a>
 <h2 class="h3">Example: WordCount v1.0</h2>
 <div class="section">
 <p>Before we jump into the details, lets walk through an example Map-Reduce 
@@ -428,7 +428,7 @@ document.write("Last Published: " + document.lastModified);
       <a href="quickstart.html#SingleNodeSetup">pseudo-distributed</a> or
       <a href="quickstart.html#Fully-Distributed+Operation">fully-distributed</a> 
       Hadoop installation.</p>
-<a name="N100E9"></a><a name="Source+Code"></a>
+<a name="N100EA"></a><a name="Source+Code"></a>
 <h3 class="h4">Source Code</h3>
 <table class="ForrestTable" cellspacing="1" cellpadding="4">
           
@@ -991,7 +991,7 @@ document.write("Last Published: " + document.lastModified);
 </tr>
         
 </table>
-<a name="N1046B"></a><a name="Usage"></a>
+<a name="N1046C"></a><a name="Usage"></a>
 <h3 class="h4">Usage</h3>
 <p>Assuming <span class="codefrag">HADOOP_HOME</span> is the root of the installation and 
         <span class="codefrag">HADOOP_VERSION</span> is the Hadoop version installed, compile 
@@ -1086,7 +1086,7 @@ document.write("Last Published: " + document.lastModified);
 <br>
         
 </p>
-<a name="N104EB"></a><a name="Walk-through"></a>
+<a name="N104EC"></a><a name="Walk-through"></a>
 <h3 class="h4">Walk-through</h3>
 <p>The <span class="codefrag">WordCount</span> application is quite straight-forward.</p>
 <p>The <span class="codefrag">Mapper</span> implementation (lines 14-26), via the 
@@ -1196,7 +1196,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
     
     
-<a name="N105A2"></a><a name="Map-Reduce+-+User+Interfaces"></a>
+<a name="N105A3"></a><a name="Map-Reduce+-+User+Interfaces"></a>
 <h2 class="h3">Map-Reduce - User Interfaces</h2>
 <div class="section">
 <p>This section provides a reasonable amount of detail on every user-facing 
@@ -1215,12 +1215,12 @@ document.write("Last Published: " + document.lastModified);
 <p>Finally, we will wrap up by discussing some useful features of the
       framework such as the <span class="codefrag">DistributedCache</span>, 
       <span class="codefrag">IsolationRunner</span> etc.</p>
-<a name="N105DB"></a><a name="Payload"></a>
+<a name="N105DC"></a><a name="Payload"></a>
 <h3 class="h4">Payload</h3>
 <p>Applications typically implement the <span class="codefrag">Mapper</span> and 
         <span class="codefrag">Reducer</span> interfaces to provide the <span class="codefrag">map</span> and 
         <span class="codefrag">reduce</span> methods. These form the core of the job.</p>
-<a name="N105F0"></a><a name="Mapper"></a>
+<a name="N105F1"></a><a name="Mapper"></a>
 <h4>Mapper</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/Mapper.html">
@@ -1276,7 +1276,7 @@ document.write("Last Published: " + document.lastModified);
           <a href="api/org/apache/hadoop/io/compress/CompressionCodec.html">
           CompressionCodec</a> to be used via the <span class="codefrag">JobConf</span>.
           </p>
-<a name="N1066A"></a><a name="How+Many+Maps%3F"></a>
+<a name="N1066B"></a><a name="How+Many+Maps%3F"></a>
 <h5>How Many Maps?</h5>
 <p>The number of maps is usually driven by the total size of the 
             inputs, that is, the total number of blocks of the input files.</p>
@@ -1289,7 +1289,7 @@ document.write("Last Published: " + document.lastModified);
             <a href="api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)">
             setNumMapTasks(int)</a> (which only provides a hint to the framework) 
             is used to set it even higher.</p>
-<a name="N10682"></a><a name="Reducer"></a>
+<a name="N10683"></a><a name="Reducer"></a>
 <h4>Reducer</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/Reducer.html">
@@ -1312,18 +1312,18 @@ document.write("Last Published: " + document.lastModified);
 <p>
 <span class="codefrag">Reducer</span> has 3 primary phases: shuffle, sort and reduce.
           </p>
-<a name="N106B2"></a><a name="Shuffle"></a>
+<a name="N106B3"></a><a name="Shuffle"></a>
 <h5>Shuffle</h5>
 <p>Input to the <span class="codefrag">Reducer</span> is the sorted output of the
             mappers. In this phase the framework fetches the relevant partition 
             of the output of all the mappers, via HTTP.</p>
-<a name="N106BF"></a><a name="Sort"></a>
+<a name="N106C0"></a><a name="Sort"></a>
 <h5>Sort</h5>
 <p>The framework groups <span class="codefrag">Reducer</span> inputs by keys (since 
             different mappers may have output the same key) in this stage.</p>
 <p>The shuffle and sort phases occur simultaneously; while 
             map-outputs are being fetched they are merged.</p>
-<a name="N106CE"></a><a name="Secondary+Sort"></a>
+<a name="N106CF"></a><a name="Secondary+Sort"></a>
 <h5>Secondary Sort</h5>
 <p>If equivalence rules for grouping the intermediate keys are 
               required to be different from those for grouping keys before 
@@ -1334,7 +1334,7 @@ document.write("Last Published: " + document.lastModified);
               JobConf.setOutputKeyComparatorClass(Class)</a> can be used to 
               control how intermediate keys are grouped, these can be used in 
               conjunction to simulate <em>secondary sort on values</em>.</p>
-<a name="N106E7"></a><a name="Reduce"></a>
+<a name="N106E8"></a><a name="Reduce"></a>
 <h5>Reduce</h5>
 <p>In this phase the 
             <a href="api/org/apache/hadoop/mapred/Reducer.html#reduce(K2, java.util.Iterator, org.apache.hadoop.mapred.OutputCollector, org.apache.hadoop.mapred.Reporter)">
@@ -1350,7 +1350,7 @@ document.write("Last Published: " + document.lastModified);
             progress, set application-level status messages and update 
             <span class="codefrag">Counters</span>, or just indicate that they are alive.</p>
 <p>The output of the <span class="codefrag">Reducer</span> is <em>not sorted</em>.</p>
-<a name="N10715"></a><a name="How+Many+Reduces%3F"></a>
+<a name="N10716"></a><a name="How+Many+Reduces%3F"></a>
 <h5>How Many Reduces?</h5>
 <p>The right number of reduces seems to be <span class="codefrag">0.95</span> or 
             <span class="codefrag">1.75</span> multiplied by (&lt;<em>no. of nodes</em>&gt; * 
@@ -1365,7 +1365,7 @@ document.write("Last Published: " + document.lastModified);
 <p>The scaling factors above are slightly less than whole numbers to 
             reserve a few reduce slots in the framework for speculative-tasks and
             failed tasks.</p>
-<a name="N1073A"></a><a name="Reducer+NONE"></a>
+<a name="N1073B"></a><a name="Reducer+NONE"></a>
 <h5>Reducer NONE</h5>
 <p>It is legal to set the number of reduce-tasks to <em>zero</em> if 
             no reduction is desired.</p>
@@ -1375,7 +1375,7 @@ document.write("Last Published: " + document.lastModified);
             setOutputPath(Path)</a>. The framework does not sort the 
             map-outputs before writing them out to the <span class="codefrag">FileSystem</span>.
             </p>
-<a name="N10755"></a><a name="Partitioner"></a>
+<a name="N10756"></a><a name="Partitioner"></a>
 <h4>Partitioner</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/Partitioner.html">
@@ -1389,7 +1389,7 @@ document.write("Last Published: " + document.lastModified);
 <p>
 <a href="api/org/apache/hadoop/mapred/lib/HashPartitioner.html">
           HashPartitioner</a> is the default <span class="codefrag">Partitioner</span>.</p>
-<a name="N10774"></a><a name="Reporter"></a>
+<a name="N10775"></a><a name="Reporter"></a>
 <h4>Reporter</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/Reporter.html">
@@ -1408,7 +1408,7 @@ document.write("Last Published: " + document.lastModified);
           </p>
 <p>Applications can also update <span class="codefrag">Counters</span> using the 
           <span class="codefrag">Reporter</span>.</p>
-<a name="N1079E"></a><a name="OutputCollector"></a>
+<a name="N1079F"></a><a name="OutputCollector"></a>
 <h4>OutputCollector</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/OutputCollector.html">
@@ -1419,7 +1419,7 @@ document.write("Last Published: " + document.lastModified);
 <p>Hadoop Map-Reduce comes bundled with a 
         <a href="api/org/apache/hadoop/mapred/lib/package-summary.html">
         library</a> of generally useful mappers, reducers, and partitioners.</p>
-<a name="N107B9"></a><a name="Job+Configuration"></a>
+<a name="N107BA"></a><a name="Job+Configuration"></a>
 <h3 class="h4">Job Configuration</h3>
 <p>
 <a href="api/org/apache/hadoop/mapred/JobConf.html">
@@ -1474,7 +1474,7 @@ document.write("Last Published: " + document.lastModified);
         <a href="api/org/apache/hadoop/conf/Configuration.html#set(java.lang.String, java.lang.String)">set(String, String)</a>/<a href="api/org/apache/hadoop/conf/Configuration.html#get(java.lang.String, java.lang.String)">get(String, String)</a>
         to set/get arbitrary parameters needed by applications. However, use the 
         <span class="codefrag">DistributedCache</span> for large amounts of (read-only) data.</p>
-<a name="N10843"></a><a name="Task+Execution+%26+Environment"></a>
+<a name="N10844"></a><a name="Task+Execution+%26+Environment"></a>
 <h3 class="h4">Task Execution &amp; Environment</h3>
 <p>The <span class="codefrag">TaskTracker</span> executes the <span class="codefrag">Mapper</span>/ 
         <span class="codefrag">Reducer</span>  <em>task</em> as a child process in a separate jvm.
@@ -1534,7 +1534,7 @@ document.write("Last Published: " + document.lastModified);
         loaded via <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/lang/System.html#loadLibrary(java.lang.String)">
         System.loadLibrary</a> or <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/lang/System.html#load(java.lang.String)">
         System.load</a>.</p>
-<a name="N108B8"></a><a name="Job+Submission+and+Monitoring"></a>
+<a name="N108B9"></a><a name="Job+Submission+and+Monitoring"></a>
 <h3 class="h4">Job Submission and Monitoring</h3>
 <p>
 <a href="api/org/apache/hadoop/mapred/JobClient.html">
@@ -1570,7 +1570,7 @@ document.write("Last Published: " + document.lastModified);
 <p>Normally the user creates the application, describes various facets 
         of the job via <span class="codefrag">JobConf</span>, and then uses the 
         <span class="codefrag">JobClient</span> to submit the job and monitor its progress.</p>
-<a name="N108F6"></a><a name="Job+Control"></a>
+<a name="N108F7"></a><a name="Job+Control"></a>
 <h4>Job Control</h4>
 <p>Users may need to chain map-reduce jobs to accomplish complex
           tasks which cannot be done via a single map-reduce job. This is fairly
@@ -1606,7 +1606,7 @@ document.write("Last Published: " + document.lastModified);
             </li>
           
 </ul>
-<a name="N10920"></a><a name="Job+Input"></a>
+<a name="N10921"></a><a name="Job+Input"></a>
 <h3 class="h4">Job Input</h3>
 <p>
 <a href="api/org/apache/hadoop/mapred/InputFormat.html">
@@ -1654,7 +1654,7 @@ document.write("Last Published: " + document.lastModified);
         appropriate <span class="codefrag">CompressionCodec</span>. However, it must be noted that
         compressed files with the above extensions cannot be <em>split</em> and 
         each compressed file is processed in its entirety by a single mapper.</p>
-<a name="N1098A"></a><a name="InputSplit"></a>
+<a name="N1098B"></a><a name="InputSplit"></a>
 <h4>InputSplit</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/InputSplit.html">
@@ -1668,7 +1668,7 @@ document.write("Last Published: " + document.lastModified);
           FileSplit</a> is the default <span class="codefrag">InputSplit</span>. It sets 
           <span class="codefrag">map.input.file</span> to the path of the input file for the
           logical split.</p>
-<a name="N109AF"></a><a name="RecordReader"></a>
+<a name="N109B0"></a><a name="RecordReader"></a>
 <h4>RecordReader</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/RecordReader.html">
@@ -1680,7 +1680,7 @@ document.write("Last Published: " + document.lastModified);
           for processing. <span class="codefrag">RecordReader</span> thus assumes the 
           responsibility of processing record boundaries and presents the tasks 
           with keys and values.</p>
-<a name="N109D2"></a><a name="Job+Output"></a>
+<a name="N109D3"></a><a name="Job+Output"></a>
 <h3 class="h4">Job Output</h3>
 <p>
 <a href="api/org/apache/hadoop/mapred/OutputFormat.html">
@@ -1705,7 +1705,7 @@ document.write("Last Published: " + document.lastModified);
 <p>
 <span class="codefrag">TextOutputFormat</span> is the default 
         <span class="codefrag">OutputFormat</span>.</p>
-<a name="N109FB"></a><a name="Task+Side-Effect+Files"></a>
+<a name="N109FC"></a><a name="Task+Side-Effect+Files"></a>
 <h4>Task Side-Effect Files</h4>
 <p>In some applications, component tasks need to create and/or write to
           side-files, which differ from the actual job-output files.</p>
@@ -1731,7 +1731,7 @@ document.write("Last Published: " + document.lastModified);
           JobConf.getOutputPath()</a>, and the framework will promote them 
           similarly for succesful task-attempts, thus eliminating the need to 
           pick unique paths per task-attempt.</p>
-<a name="N10A30"></a><a name="RecordWriter"></a>
+<a name="N10A31"></a><a name="RecordWriter"></a>
 <h4>RecordWriter</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/RecordWriter.html">
@@ -1739,9 +1739,9 @@ document.write("Last Published: " + document.lastModified);
           pairs to an output file.</p>
 <p>RecordWriter implementations write the job outputs to the 
           <span class="codefrag">FileSystem</span>.</p>
-<a name="N10A47"></a><a name="Other+Useful+Features"></a>
+<a name="N10A48"></a><a name="Other+Useful+Features"></a>
 <h3 class="h4">Other Useful Features</h3>
-<a name="N10A4D"></a><a name="Counters"></a>
+<a name="N10A4E"></a><a name="Counters"></a>
 <h4>Counters</h4>
 <p>
 <span class="codefrag">Counters</span> represent global counters, defined either by 
@@ -1755,7 +1755,7 @@ document.write("Last Published: " + document.lastModified);
           Reporter.incrCounter(Enum, long)</a> in the <span class="codefrag">map</span> and/or 
           <span class="codefrag">reduce</span> methods. These counters are then globally 
           aggregated by the framework.</p>
-<a name="N10A78"></a><a name="DistributedCache"></a>
+<a name="N10A79"></a><a name="DistributedCache"></a>
 <h4>DistributedCache</h4>
 <p>
 <a href="api/org/apache/hadoop/filecache/DistributedCache.html">
@@ -1788,7 +1788,7 @@ document.write("Last Published: " + document.lastModified);
           <a href="api/org/apache/hadoop/filecache/DistributedCache.html#createSymlink(org.apache.hadoop.conf.Configuration)">
           DistributedCache.createSymlink(Path, Configuration)</a> api. Files 
           have <em>execution permissions</em> set.</p>
-<a name="N10AB6"></a><a name="Tool"></a>
+<a name="N10AB7"></a><a name="Tool"></a>
 <h4>Tool</h4>
 <p>The <a href="api/org/apache/hadoop/util/Tool.html">Tool</a> 
           interface supports the handling of generic Hadoop command-line options.
@@ -1828,7 +1828,7 @@ document.write("Last Published: " + document.lastModified);
             </span>
           
 </p>
-<a name="N10AE8"></a><a name="IsolationRunner"></a>
+<a name="N10AE9"></a><a name="IsolationRunner"></a>
 <h4>IsolationRunner</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/IsolationRunner.html">
@@ -1852,13 +1852,13 @@ document.write("Last Published: " + document.lastModified);
 <p>
 <span class="codefrag">IsolationRunner</span> will run the failed task in a single 
           jvm, which can be in the debugger, over precisely the same input.</p>
-<a name="N10B1B"></a><a name="JobControl"></a>
+<a name="N10B1C"></a><a name="JobControl"></a>
 <h4>JobControl</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/jobcontrol/package-summary.html">
           JobControl</a> is a utility which encapsulates a set of Map-Reduce jobs
           and their dependencies.</p>
-<a name="N10B28"></a><a name="Data+Compression"></a>
+<a name="N10B29"></a><a name="Data+Compression"></a>
 <h4>Data Compression</h4>
 <p>Hadoop Map-Reduce provides facilities for the application-writer to
           specify compression for both intermediate map-outputs and the
@@ -1872,7 +1872,7 @@ document.write("Last Published: " + document.lastModified);
           codecs for reasons of both performance (zlib) and non-availability of
           Java libraries (lzo). More details on their usage and availability are
           available <a href="native_libraries.html">here</a>.</p>
-<a name="N10B48"></a><a name="Intermediate+Outputs"></a>
+<a name="N10B49"></a><a name="Intermediate+Outputs"></a>
 <h5>Intermediate Outputs</h5>
 <p>Applications can control compression of intermediate map-outputs
             via the 
@@ -1893,7 +1893,7 @@ document.write("Last Published: " + document.lastModified);
             <a href="api/org/apache/hadoop/mapred/JobConf.html#setMapOutputCompressionType(org.apache.hadoop.io.SequenceFile.CompressionType)">
             JobConf.setMapOutputCompressionType(SequenceFile.CompressionType)</a> 
             api.</p>
-<a name="N10B74"></a><a name="Job+Outputs"></a>
+<a name="N10B75"></a><a name="Job+Outputs"></a>
 <h5>Job Outputs</h5>
 <p>Applications can control compression of job-outputs via the
             <a href="api/org/apache/hadoop/mapred/OutputFormatBase.html#setCompressOutput(org.apache.hadoop.mapred.JobConf,%20boolean)">
@@ -1913,7 +1913,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
 
     
-<a name="N10BA3"></a><a name="Example%3A+WordCount+v2.0"></a>
+<a name="N10BA4"></a><a name="Example%3A+WordCount+v2.0"></a>
 <h2 class="h3">Example: WordCount v2.0</h2>
 <div class="section">
 <p>Here is a more complete <span class="codefrag">WordCount</span> which uses many of the
@@ -1923,7 +1923,7 @@ document.write("Last Published: " + document.lastModified);
       <a href="quickstart.html#SingleNodeSetup">pseudo-distributed</a> or
       <a href="quickstart.html#Fully-Distributed+Operation">fully-distributed</a> 
       Hadoop installation.</p>
-<a name="N10BBD"></a><a name="Source+Code-N10BBD"></a>
+<a name="N10BBE"></a><a name="Source+Code-N10BBE"></a>
 <h3 class="h4">Source Code</h3>
 <table class="ForrestTable" cellspacing="1" cellpadding="4">
           
@@ -3133,7 +3133,7 @@ document.write("Last Published: " + document.lastModified);
 </tr>
         
 </table>
-<a name="N1131F"></a><a name="Sample+Runs"></a>
+<a name="N11320"></a><a name="Sample+Runs"></a>
 <h3 class="h4">Sample Runs</h3>
 <p>Sample text-files as input:</p>
 <p>
@@ -3301,7 +3301,7 @@ document.write("Last Published: " + document.lastModified);
 <br>
         
 </p>
-<a name="N113F3"></a><a name="Highlights"></a>
+<a name="N113F4"></a><a name="Highlights"></a>
 <h3 class="h4">Highlights</h3>
 <p>The second version of <span class="codefrag">WordCount</span> improves upon the 
         previous one by using some features offered by the Map-Reduce framework:

+ 6 - 6
docs/native_libraries.html

@@ -190,7 +190,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
   
     
-<a name="N1000C"></a><a name="Purpose"></a>
+<a name="N1000D"></a><a name="Purpose"></a>
 <h2 class="h3">Purpose</h2>
 <div class="section">
 <p>Hadoop has native implementations of certain components for reasons of 
@@ -201,7 +201,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
     
     
-<a name="N10019"></a><a name="Components"></a>
+<a name="N1001A"></a><a name="Components"></a>
 <h2 class="h3">Components</h2>
 <div class="section">
 <p>Hadoop currently has the following 
@@ -227,7 +227,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
 
     
-<a name="N1003C"></a><a name="Usage"></a>
+<a name="N1003D"></a><a name="Usage"></a>
 <h2 class="h3">Usage</h2>
 <div class="section">
 <p>It is fairly simple to use the native hadoop libraries:</p>
@@ -281,7 +281,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
     
     
-<a name="N10086"></a><a name="Supported+Platforms"></a>
+<a name="N10087"></a><a name="Supported+Platforms"></a>
 <h2 class="h3">Supported Platforms</h2>
 <div class="section">
 <p>Hadoop native library is supported only on *nix platforms only.
@@ -311,7 +311,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
     
     
-<a name="N100B6"></a><a name="Building+Native+Hadoop+Libraries"></a>
+<a name="N100B7"></a><a name="Building+Native+Hadoop+Libraries"></a>
 <h2 class="h3">Building Native Hadoop Libraries</h2>
 <div class="section">
 <p>Hadoop native library is written in 
@@ -360,7 +360,7 @@ document.write("Last Published: " + document.lastModified);
 <p>where &lt;platform&gt; is combination of the system-properties: 
       <span class="codefrag">${os.name}-${os.arch}-${sun.arch.data.model}</span>; for e.g. 
       Linux-i386-32.</p>
-<a name="N10109"></a><a name="Notes"></a>
+<a name="N1010A"></a><a name="Notes"></a>
 <h3 class="h4">Notes</h3>
 <ul>
           

+ 13 - 13
docs/quickstart.html

@@ -215,7 +215,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
   
     
-<a name="N1000C"></a><a name="Purpose"></a>
+<a name="N1000D"></a><a name="Purpose"></a>
 <h2 class="h3">Purpose</h2>
 <div class="section">
 <p>The purpose of this document is to help users get a single-node Hadoop 
@@ -227,10 +227,10 @@ document.write("Last Published: " + document.lastModified);
 </div>
     
     
-<a name="N1001E"></a><a name="PreReqs"></a>
+<a name="N1001F"></a><a name="PreReqs"></a>
 <h2 class="h3">Pre-requisites</h2>
 <div class="section">
-<a name="N10024"></a><a name="Supported+Platforms"></a>
+<a name="N10025"></a><a name="Supported+Platforms"></a>
 <h3 class="h4">Supported Platforms</h3>
 <ul>
           
@@ -245,7 +245,7 @@ document.write("Last Published: " + document.lastModified);
           </li>
         
 </ul>
-<a name="N1003A"></a><a name="Required+Software"></a>
+<a name="N1003B"></a><a name="Required+Software"></a>
 <h3 class="h4">Required Software</h3>
 <ol>
           
@@ -262,7 +262,7 @@ document.write("Last Published: " + document.lastModified);
           </li>
         
 </ol>
-<a name="N10055"></a><a name="Additional+requirements+for+Windows"></a>
+<a name="N10056"></a><a name="Additional+requirements+for+Windows"></a>
 <h4>Additional requirements for Windows</h4>
 <ol>
             
@@ -273,7 +273,7 @@ document.write("Last Published: " + document.lastModified);
             </li>
           
 </ol>
-<a name="N10067"></a><a name="Installing+Software"></a>
+<a name="N10068"></a><a name="Installing+Software"></a>
 <h3 class="h4">Installing Software</h3>
 <p>If your cluster doesn't have the requisite software you will need to
         install it.</p>
@@ -296,7 +296,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
     
     
-<a name="N1008B"></a><a name="Download"></a>
+<a name="N1008C"></a><a name="Download"></a>
 <h2 class="h3">Download</h2>
 <div class="section">
 <p>
@@ -318,7 +318,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
     
     
-<a name="N100AE"></a><a name="Standalone+Operation"></a>
+<a name="N100AF"></a><a name="Standalone+Operation"></a>
 <h2 class="h3">Standalone Operation</h2>
 <div class="section">
 <p>By default, Hadoop is configured to run things in a non-distributed 
@@ -346,12 +346,12 @@ document.write("Last Published: " + document.lastModified);
 </div>
     
     
-<a name="N100D2"></a><a name="SingleNodeSetup"></a>
+<a name="N100D3"></a><a name="SingleNodeSetup"></a>
 <h2 class="h3">Pseudo-Distributed Operation</h2>
 <div class="section">
 <p>Hadoop can also be run on a single-node in a pseudo-distributed mode 
 	  where each Hadoop daemon runs in a separate Java process.</p>
-<a name="N100DB"></a><a name="Configuration"></a>
+<a name="N100DC"></a><a name="Configuration"></a>
 <h3 class="h4">Configuration</h3>
 <p>Use the following <span class="codefrag">conf/hadoop-site.xml</span>:</p>
 <table class="ForrestTable" cellspacing="1" cellpadding="4">
@@ -417,7 +417,7 @@ document.write("Last Published: " + document.lastModified);
 </tr>
         
 </table>
-<a name="N1013F"></a><a name="Setup+passphraseless"></a>
+<a name="N10140"></a><a name="Setup+passphraseless"></a>
 <h3 class="h4">Setup passphraseless ssh</h3>
 <p>
           Now check that you can ssh to the localhost without a passphrase:<br>
@@ -435,7 +435,7 @@ document.write("Last Published: " + document.lastModified);
 <span class="codefrag">$ cat ~/.ssh/id_dsa.pub &gt;&gt; ~/.ssh/authorized_keys</span>
 		
 </p>
-<a name="N1015C"></a><a name="Execution"></a>
+<a name="N1015D"></a><a name="Execution"></a>
 <h3 class="h4">Execution</h3>
 <p>
           Format a new distributed-filesystem:<br>
@@ -512,7 +512,7 @@ document.write("Last Published: " + document.lastModified);
 </div>
     
     
-<a name="N101C9"></a><a name="Fully-Distributed+Operation"></a>
+<a name="N101CA"></a><a name="Fully-Distributed+Operation"></a>
 <h2 class="h3">Fully-Distributed Operation</h2>
 <div class="section">
 <p>Information on setting up fully-distributed non-trivial clusters

+ 25 - 25
docs/streaming.html

@@ -253,7 +253,7 @@ document.write("Last Published: " + document.lastModified);
 </ul>
 </div>
 
-<a name="N10018"></a><a name="Hadoop+Streaming"></a>
+<a name="N10019"></a><a name="Hadoop+Streaming"></a>
 <h2 class="h3">Hadoop Streaming</h2>
 <div class="section">
 <p>
@@ -269,7 +269,7 @@ $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
 </div>
 
 
-<a name="N10026"></a><a name="How+Does+Streaming+Work"></a>
+<a name="N10027"></a><a name="How+Does+Streaming+Work"></a>
 <h2 class="h3">How Does Streaming Work </h2>
 <div class="section">
 <p>
@@ -298,7 +298,7 @@ $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
 </div>
 
 
-<a name="N1004E"></a><a name="Package+Files+With+Job+Submissions"></a>
+<a name="N1004F"></a><a name="Package+Files+With+Job+Submissions"></a>
 <h2 class="h3">Package Files With Job Submissions</h2>
 <div class="section">
 <p>
@@ -330,10 +330,10 @@ $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
 </div>
 
 
-<a name="N10066"></a><a name="Streaming+Options+and+Usage"></a>
+<a name="N10067"></a><a name="Streaming+Options+and+Usage"></a>
 <h2 class="h3">Streaming Options and Usage </h2>
 <div class="section">
-<a name="N1006C"></a><a name="Mapper-Only+Jobs"></a>
+<a name="N1006D"></a><a name="Mapper-Only+Jobs"></a>
 <h3 class="h4">Mapper-Only Jobs </h3>
 <p>
 Often, you may want to process input data using a map function only. To do this, simply set mapred.reduce.tasks to zero. The map/reduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.
@@ -341,7 +341,7 @@ Often, you may want to process input data using a map function only. To do this,
 <p>
 To be backward compatible, Hadoop Streaming also supports the "-reduce NONE" option, which is equivalent to "-jobconf mapred.reduce.tasks=0".
 </p>
-<a name="N10078"></a><a name="Specifying+Other+Plugins+for+Jobs"></a>
+<a name="N10079"></a><a name="Specifying+Other+Plugins+for+Jobs"></a>
 <h3 class="h4">Specifying Other Plugins for Jobs </h3>
 <p>
 Just as with a normal map/reduce job, you can specify other plugins for a streaming job:
@@ -358,7 +358,7 @@ The class you supply for the input format should return key/value pairs of Text
 <p>
 The class you supply for the output format is expected to take key/value pairs of Text class. If you do not specify an output format class, the TextOutputFormat is used as the default.
 </p>
-<a name="N1008B"></a><a name="Large+files+and+archives+in+Hadoop+Streaming"></a>
+<a name="N1008C"></a><a name="Large+files+and+archives+in+Hadoop+Streaming"></a>
 <h3 class="h4">Large files and archives in Hadoop Streaming </h3>
 <p>
 The -cacheFile and -cacheArchive options allow you to make files and archives available to the tasks. The argument is a URI to the file or archive that you have already uploaded to HDFS. These files and archives are cached across jobs. You can retrieve the host and fs_port values from the fs.default.name config variable.
@@ -427,7 +427,7 @@ This is just the cache string
 This is just the second cache string
 
 </pre>
-<a name="N100B4"></a><a name="Specifying+Additional+Configuration+Variables+for+Jobs"></a>
+<a name="N100B5"></a><a name="Specifying+Additional+Configuration+Variables+for+Jobs"></a>
 <h3 class="h4">Specifying Additional Configuration Variables for Jobs </h3>
 <p>
 You can specify additional configuration variables by using "-jobconf  &lt;n&gt;=&lt;v&gt;". For example: 
@@ -446,7 +446,7 @@ The -jobconf mapred.reduce.tasks=2 in the above example specifies to use two red
 <p>
 For more details on the jobconf parameters see: <a href="http://wiki.apache.org/hadoop/JobConfFile">http://wiki.apache.org/hadoop/JobConfFile</a>
 </p>
-<a name="N100CB"></a><a name="Other+Supported+Options"></a>
+<a name="N100CC"></a><a name="Other+Supported+Options"></a>
 <h3 class="h4">Other Supported Options </h3>
 <p>
 Other options you may specify for a streaming job are described here:
@@ -528,10 +528,10 @@ To set an environment variable in a streaming command use:
 </div>
 
 
-<a name="N10183"></a><a name="More+usage+examples"></a>
+<a name="N10184"></a><a name="More+usage+examples"></a>
 <h2 class="h3">More usage examples </h2>
 <div class="section">
-<a name="N10189"></a><a name="Customizing+the+Way+to+Split+Lines+into+Key%2FValue+Pairs"></a>
+<a name="N1018A"></a><a name="Customizing+the+Way+to+Split+Lines+into+Key%2FValue+Pairs"></a>
 <h3 class="h4">Customizing the Way to Split Lines into Key/Value Pairs </h3>
 <p>
 As noted earlier, when the map/reduce framework reads a line from the stdout of the mapper, it splits the line into a key/value pair. By default, the prefix of the line up to the first tab character is the key and the the rest of the line (excluding the tab character) is the value.
@@ -554,7 +554,7 @@ In the above example, "-jobconf stream.map.output.field.separator=." specifies "
 <p>
 Similarly, you can use "-jobconf stream.reduce.output.field.separator=SEP" and "-jobconf stream.num.reduce.output.fields=NUM" to specify the nth field separator in a line of the reduce outputs as the separator between the key and the value.
 </p>
-<a name="N1019F"></a><a name="A+Useful+Partitioner+Class+%28secondary+sort%2C+the+-partitioner+org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner+option%29"></a>
+<a name="N101A0"></a><a name="A+Useful+Partitioner+Class+%28secondary+sort%2C+the+-partitioner+org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner+option%29"></a>
 <h3 class="h4">A Useful Partitioner Class (secondary sort, the -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner option) </h3>
 <p>
 Hadoop has a library class, org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner, that is useful for many applications. This class allows the map/reduce framework to partition the map outputs based on prefixes of keys, not the whole keys. For example:
@@ -614,7 +614,7 @@ Sorting within each partition for the reducer(all 4 fields used for sorting)</p>
 11.14.2.2
 11.14.2.3
 </pre>
-<a name="N101D5"></a><a name="Working+with+the+Hadoop+Aggregate+Package+%28the+-reduce+aggregate+option%29"></a>
+<a name="N101D6"></a><a name="Working+with+the+Hadoop+Aggregate+Package+%28the+-reduce+aggregate+option%29"></a>
 <h3 class="h4">Working with the Hadoop Aggregate Package (the -reduce aggregate option) </h3>
 <p>
 Hadoop has a library package called "Aggregate" (<a href="https://svn.apache.org/repos/asf/hadoop/core/trunk/src/java/org/apache/hadoop/mapred/lib/aggregate">https://svn.apache.org/repos/asf/hadoop/core/trunk/src/java/org/apache/hadoop/mapred/lib/aggregate</a>).  Aggregate provides a special reducer class and a special combiner class, and a list of simple aggregators that perform aggregations such as "sum", "max", "min" and so on  over a sequence of values. Aggregate allows you to define a mapper plugin class that is expected to generate "aggregatable items" for each input key/value pair of the mappers. The combiner/reducer will aggregate those aggregatable items by invoking the appropriate aggregators.
@@ -655,7 +655,7 @@ def main(argv):
 if __name__ == "__main__":
      main(sys.argv)
 </pre>
-<a name="N101F0"></a><a name="Field+Selection+%28+similar+to+unix+%27cut%27+command%29"></a>
+<a name="N101F1"></a><a name="Field+Selection+%28+similar+to+unix+%27cut%27+command%29"></a>
 <h3 class="h4">Field Selection ( similar to unix 'cut' command) </h3>
 <p>
 Hadoop has a library class, org.apache.hadoop.mapred.lib.FieldSelectionMapReduce, that effectively allows you to process text data like the unix "cut" utility. The map function defined in the class treats each input key/value pair as a list of fields. You can specify the field separator (the default is the tab character). You can select an arbitrary list of fields as the map output key, and an arbitrary list of fields as the map output value. Similarly, the reduce function defined in the class treats each input key/value pair as a list of fields. You can select an arbitrary list of fields as the reduce output key, and an arbitrary list of fields as the reduce output value. For example:
@@ -684,15 +684,15 @@ The option "-jobconf reduce.output.key.value.fields.spec=0-2:0-" specifies key/v
 </div>
 
 
-<a name="N10204"></a><a name="Frequently+Asked+Questions"></a>
+<a name="N10205"></a><a name="Frequently+Asked+Questions"></a>
 <h2 class="h3">Frequently Asked Questions </h2>
 <div class="section">
-<a name="N1020A"></a><a name="How+do+I+use+Hadoop+Streaming+to+run+an+arbitrary+set+of+%28semi-%29independent+tasks%3F"></a>
+<a name="N1020B"></a><a name="How+do+I+use+Hadoop+Streaming+to+run+an+arbitrary+set+of+%28semi-%29independent+tasks%3F"></a>
 <h3 class="h4">How do I use Hadoop Streaming to run an arbitrary set of (semi-)independent tasks? </h3>
 <p>
 Often you do not need the full power of Map Reduce, but only need to run multiple instances of the same program - either on different parts of the data, or on the same data, but with different parameters. You can use Hadoop Streaming to do this.
 </p>
-<a name="N10214"></a><a name="How+do+I+process+files%2C+one+per+map%3F"></a>
+<a name="N10215"></a><a name="How+do+I+process+files%2C+one+per+map%3F"></a>
 <h3 class="h4">How do I process files, one per map? </h3>
 <p>
 As an example, consider the problem of zipping (compressing) a set of files across the hadoop cluster. You can achieve this using either of these methods:
@@ -736,13 +736,13 @@ As an example, consider the problem of zipping (compressing) a set of files acro
 </li>
 
 </ol>
-<a name="N1023F"></a><a name="How+many+reducers+should+I+use%3F"></a>
+<a name="N10240"></a><a name="How+many+reducers+should+I+use%3F"></a>
 <h3 class="h4">How many reducers should I use? </h3>
 <p>
 See the Hadoop Wiki for details: <a href="http://wiki.apache.org/hadoop/HowManyMapsAndReduces">http://wiki.apache.org/hadoop/HowManyMapsAndReduces</a>
 
 </p>
-<a name="N1024D"></a><a name="If+I+set+up+an+alias+in+my+shell+script%2C+will+that+work+after+-mapper%2C+i.e.+say+I+do%3A+alias+c1%3D%27cut+-f1%27.+Will+-mapper+%22c1%22+work%3F"></a>
+<a name="N1024E"></a><a name="If+I+set+up+an+alias+in+my+shell+script%2C+will+that+work+after+-mapper%2C+i.e.+say+I+do%3A+alias+c1%3D%27cut+-f1%27.+Will+-mapper+%22c1%22+work%3F"></a>
 <h3 class="h4">If I set up an alias in my shell script, will that work after -mapper, i.e. say I do: alias c1='cut -f1'. Will -mapper "c1" work? </h3>
 <p>
 Using an alias will not work, but variable substitution is allowed as shown in this example:
@@ -769,12 +769,12 @@ $ hadoop dfs -cat samples/student_out/part-00000
 75
 80
 </pre>
-<a name="N1025B"></a><a name="Can+I+use+UNIX+pipes%3F+For+example%2C+will+-mapper+%22cut+-f1+%7C+sed+s%2Ffoo%2Fbar%2Fg%22+work%3F"></a>
+<a name="N1025C"></a><a name="Can+I+use+UNIX+pipes%3F+For+example%2C+will+-mapper+%22cut+-f1+%7C+sed+s%2Ffoo%2Fbar%2Fg%22+work%3F"></a>
 <h3 class="h4">Can I use UNIX pipes? For example, will -mapper "cut -f1 | sed s/foo/bar/g" work?</h3>
 <p>
 Currently this does not work and gives an "java.io.IOException: Broken pipe" error. This is probably a bug that needs to be investigated.
 </p>
-<a name="N10265"></a><a name="When+I+run+a+streaming+job+by"></a>
+<a name="N10266"></a><a name="When+I+run+a+streaming+job+by"></a>
 <h3 class="h4">When I run a streaming job by distributing large executables (for example, 3.6G) through the -file option, I get a "No space left on device" error. What do I do? </h3>
 <p>
 The jar packaging happens in a directory pointed to by the configuration variable stream.tmpdir. The default value of stream.tmpdir is /tmp. Set the value to a directory with more space:
@@ -782,7 +782,7 @@ The jar packaging happens in a directory pointed to by the configuration variabl
 <pre class="code">
 -jobconf stream.tmpdir=/export/bigspace/...
 </pre>
-<a name="N10276"></a><a name="How+do+I+specify+multiple+input+directories%3F"></a>
+<a name="N10277"></a><a name="How+do+I+specify+multiple+input+directories%3F"></a>
 <h3 class="h4">How do I specify multiple input directories? </h3>
 <p>
 You can specify multiple input directories with multiple '-input' options:
@@ -790,17 +790,17 @@ You can specify multiple input directories with multiple '-input' options:
 <pre class="code">
  hadoop jar hadoop-streaming.jar -input '/user/foo/dir1' -input '/user/foo/dir2' 
 </pre>
-<a name="N10283"></a><a name="How+do+I+generate+output+files+with+gzip+format%3F"></a>
+<a name="N10284"></a><a name="How+do+I+generate+output+files+with+gzip+format%3F"></a>
 <h3 class="h4">How do I generate output files with gzip format? </h3>
 <p>
 Instead of plain text files, you can generate gzip files as your generated output. Pass '-jobconf mapred.output.compress=true -jobconf  mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCode' as option to your streaming job.
 </p>
-<a name="N1028D"></a><a name="How+do+I+provide+my+own+input%2Foutput+format+with+streaming%3F"></a>
+<a name="N1028E"></a><a name="How+do+I+provide+my+own+input%2Foutput+format+with+streaming%3F"></a>
 <h3 class="h4">How do I provide my own input/output format with streaming? </h3>
 <p>
 At least as late as version 0.14, Hadoop does not support multiple jar files. So, when specifying your own custom classes you will have to pack them along with the streaming jar and use the custom jar instead of the default hadoop streaming jar. 
 </p>
-<a name="N10297"></a><a name="How+do+I+parse+XML+documents+using+streaming%3F"></a>
+<a name="N10298"></a><a name="How+do+I+parse+XML+documents+using+streaming%3F"></a>
 <h3 class="h4">How do I parse XML documents using streaming? </h3>
 <p>
 You can use the record reader StreamXmlRecordReader to process XML documents. 

+ 2 - 2
src/docs/src/documentation/content/xdocs/tabs.xml

@@ -18,8 +18,8 @@
 <!DOCTYPE tabs PUBLIC "-//APACHE//DTD Cocoon Documentation Tab V1.0//EN" 
           "http://forrest.apache.org/dtd/tab-cocoon-v10.dtd">
 
-<tabs software="Nutch"
-      title="Nutch"
+<tabs software="Hadoop"
+      title="Hadoop"
       copyright="The Apache Software Foundation"
       xmlns:xlink="http://www.w3.org/1999/xlink">