12 years ago · e9f01ac284
--- a/hadoop-common-project/hadoop-common/CHANGES.txt
+++ b/hadoop-common-project/hadoop-common/CHANGES.txt
@@ -132,6 +132,8 @@ Trunk (Unreleased)
 
															     HADOOP-9004. Allow security unit tests to use external KDC. (Stephen Chu
														
 
															     via suresh)
														
 
															+    HADOOP-6616. Improve documentation for rack awareness. (Adam Faris via jghoman)
														
 
															+
														
 
															   BUG FIXES
														
 
															     HADOOP-8177. MBeans shouldn't try to register when it fails to create MBeanName.
														
--- a/hadoop-common-project/hadoop-common/src/main/docs/src/documentation/content/xdocs/cluster_setup.xml
+++ b/hadoop-common-project/hadoop-common/src/main/docs/src/documentation/content/xdocs/cluster_setup.xml
@@ -1292,23 +1292,139 @@
 
															     <section>
														
 
															       <title>Hadoop Rack Awareness</title>
														
 
															-      <p>The HDFS and the Map/Reduce components are rack-aware.</p>
														
 
															-      <p>The <code>NameNode</code> and the <code>JobTracker</code> obtains the
														
 
															-      <code>rack id</code> of the slaves in the cluster by invoking an API 
														
 
															-      <a href="ext:api/org/apache/hadoop/net/dnstoswitchmapping/resolve
														
 
															-      ">resolve</a> in an administrator configured
														
 
															-      module. The API resolves the slave's DNS name (also IP address) to a 
														
 
															-      rack id. What module to use can be configured using the configuration
														
 
															-      item <code>net.topology.node.switch.mapping.impl</code>. The default 
														
 
															-      implementation of the same runs a script/command configured using 
														
 
															-      <code>net.topology.script.file.name</code>. If topology.script.file.name is
														
 
															-      not set, the rack id <code>/default-rack</code> is returned for any 
														
 
															-      passed IP address. The additional configuration in the Map/Reduce
														
 
															-      part is <code>mapred.cache.task.levels</code> which determines the number
														
 
															-      of levels (in the network topology) of caches. So, for example, if it is
														
 
															-      the default value of 2, two levels of caches will be constructed - 
														
 
															-      one for hosts (host -> task mapping) and another for racks 
														
 
															-      (rack -> task mapping).
														
 
															+      <p>
														
 
															+         Both HDFS and Map/Reduce components are rack-aware.  HDFS block placement will use rack 
														
 
															+         awareness for fault tolerance by placing one block replica on a different rack.  This provides 
														
 
															+         data availability in the event of a network switch failure within the cluster.  The jobtracker uses rack
														
 
															+         awareness to reduce network transfers of HDFS data blocks by attempting to schedule tasks on datanodes with a local
														
 
															+         copy of needed HDFS blocks.  If the tasks cannot be scheduled on the datanodes
														
 
															+         containing the needed HDFS blocks, then the tasks will be scheduled on the same rack to reduce network transfers if possible.
														
 
															+      </p>
														
 
															+      <p>The NameNode and the JobTracker obtain the rack id of the cluster slaves by invoking either 
														
 
															+         an external script or java class as specified by configuration files.  Using either the 
														
 
															+         java class or external script for topology, output must adhere to the java 
														
 
															+         <a href="ext:api/org/apache/hadoop/net/dnstoswitchmapping/resolve">DNSToSwitchMapping</a> 
														
 
															+         interface.  The interface expects a one-to-one correspondence to be maintained 
														
 
															+         and the topology information in the format of '/myrack/myhost', where '/' is the topology 
														
 
															+         delimiter, 'myrack' is the rack identifier, and 'myhost' is the individual host.  Assuming 
														
 
															+         a single /24 subnet per rack, one could use the format of '/192.168.100.0/192.168.100.5' as a 
														
 
															+         unique rack-host topology mapping.
														
 
															+      </p>
														
 
															+      <p>
														
 
															+         To use the java class for topology mapping, the class name is specified by the 
														
 
															+         <code>'topology.node.switch.mapping.impl'</code> parameter in the configuration file. 
														
 
															+         An example, NetworkTopology.java, is included with the hadoop distribution and can be customized 
														
 
															+         by the hadoop administrator.  If not included with your distribution, NetworkTopology.java can also be found in the Hadoop 
														
 
															+         <a href="http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopology.java?view=markup">
														
 
															+         subversion tree</a>.  Using a java class instead of an external script has a slight performance benefit in 
														
 
															+         that it doesn't need to fork an external process when a new slave node registers itself with the jobtracker or namenode.  
														
 
															+         As this class is only used during slave node registration, the performance benefit is limited.  
														
 
															+      </p>
														
 
															+      <p>
														
 
															+         If implementing an external script, it will be specified with the
														
 
															+         <code>topology.script.file.name</code> parameter in the configuration files.  Unlike the java 
														
 
															+         class, the external topology script is not included with the Hadoop distribution and is provided by the 
														
 
															+         administrator.  Hadoop will send multiple IP addresses to ARGV when forking the topology script.  The  
														
 
															+         number of IP addresses sent to the topology script is controlled with <code>net.topology.script.number.args</code>
														
 
															+         and defaults to 100. If <code>net.topology.script.number.args</code> was changed to 1, a topology script would 
														
 
															+         get forked for each IP submitted by datanodes and/or tasktrackers.  Below are example topology scripts.
														
 
															+      </p>
														
 
															+      <section>
														
 
															+      <title>Python example</title>
														
 
															+      <source>
														
 
															+      <code>
														
 
															+      #!/usr/bin/python
														
 
															+
														
 
															+      # this script makes assumptions about the physical environment.
														
 
															+      #  1) each rack is its own layer 3 network with a /24 subnet, which could be typical where each rack has its own
														
 
															+      #     switch with uplinks to a central core router.
														
 
															+      #     
														
 
															+      #             +-----------+
														
 
															+      #             |core router|
														
 
															+      #             +-----------+
														
 
															+      #            /             \
														
 
															+      #   +-----------+        +-----------+
														
 
															+      #   |rack switch|        |rack switch|
														
 
															+      #   +-----------+        +-----------+
														
 
															+      #   | data node |        | data node |
														
 
															+      #   +-----------+        +-----------+
														
 
															+      #   | data node |        | data node |
														
 
															+      #   +-----------+        +-----------+
														
 
															+      #
														
 
															+      # 2) topology script gets list of IP's as input, calculates network address, and prints '/network_address/ip'.
														
 
															+
														
 
															+      import netaddr
														
 
															+      import sys             
														
 
															+      sys.argv.pop(0)                                                  # discard name of topology script from argv list as we just want IP addresses
														
 
															+
														
 
															+      netmask = '255.255.255.0'                                        # set netmask to what's being used in your environment.  The example uses a /24
														
 
															+
														
 
															+      for ip in sys.argv:                                              # loop over list of datanode IP's
														
 
															+          address = '{0}/{1}'.format(ip, netmask)                      # format address string so it looks like 'ip/netmask' to make netaddr work
														
 
															+          try:
														
 
															+              network_address = netaddr.IPNetwork(address).network     # calculate and print network address
														
 
															+              print "/{0}".format(network_address)                     
														
 
															+          except:
														
 
															+              print "/rack-unknown"                                    # print catch-all value if unable to calculate network address
														
 
															+
														
 
															+      </code>
														
 
															+      </source>
														
 
															+      </section>
														
 
															+          
														
 
															+      <section>
														
 
															+      <title>Bash  example</title>
														
 
															+      <source>
														
 
															+      <code>
														
 
															+      #!/bin/bash
														
 
															+      # Here's a bash example to show just how simple these scripts can be
														
 
															+      
														
 
															+      # Assuming we have flat network with everything on a single switch, we can fake a rack topology. 
														
 
															+      # This could occur in a lab environment where we have limited nodes,like 2-8 physical machines on a unmanaged switch. 
														
 
															+      # This may also apply to multiple virtual machines running on the same physical hardware.  
														
 
															+      # The number of machines isn't important, but that we are trying to fake a network topology when there isn't one. 
														
 
															+      #
														
 
															+      #       +----------+    +--------+
														
 
															+      #       |jobtracker|    |datanode| 
														
 
															+      #       +----------+    +--------+
														
 
															+      #              \        /
														
 
															+      #  +--------+  +--------+  +--------+
														
 
															+      #  |datanode|--| switch |--|datanode|
														
 
															+      #  +--------+  +--------+  +--------+
														
 
															+      #              /        \
														
 
															+      #       +--------+    +--------+
														
 
															+      #       |datanode|    |namenode| 
														
 
															+      #       +--------+    +--------+
														
 
															+      #
														
 
															+      # With this network topology, we are treating each host as a rack.  This is being done by taking the last octet 
														
 
															+      # in the datanode's IP and prepending it with the word '/rack-'.  The advantage for doing this is so HDFS
														
 
															+      # can create its 'off-rack' block copy.
														
 
															+      
														
 
															+      # 1) 'echo $@' will echo all ARGV values to xargs.  
														
 
															+      # 2) 'xargs' will enforce that we print a single argv value per line
														
 
															+      # 3) 'awk' will split fields on dots and append the last field to the string '/rack-'. If awk 
														
 
															+      #    fails to split on four dots, it will still print '/rack-' last field value
														
 
															+
														
 
															+      echo $@ | xargs -n 1 | awk -F '.' '{print "/rack-"$NF}'
														
 
															+
														
 
															+
														
 
															+      </code>
														
 
															+      </source>
														
 
															+      </section>
														
 
															+
														
 
															+
														
 
															+      <p>
														
 
															+         If <code>topology.script.file.name</code> or <code>topology.node.switch.mapping.impl</code> is 
														
 
															+         not set, the rack id '/default-rack' is returned for any passed IP address.  
														
 
															+         While this behavior appears desirable, it can cause issues with HDFS block replication as 
														
 
															+         default behavior is to write one replicated block off rack and is unable to do so as there is 
														
 
															+         only a single rack named '/default-rack'.
														
 
															+      </p>
														
 
															+      <p>
														
 
															+         An additional configuration setting is <code>mapred.cache.task.levels</code> which determines 
														
 
															+         the number of levels (in the network topology) of caches. So, for example, if it is the 
														
 
															+         default value of 2, two levels of caches will be constructed - one for hosts 
														
 
															+         (host -> task mapping) and another for racks (rack -> task mapping). Giving us our one-to-one 
														
 
															+          mapping of '/myrack/myhost'
														
 
															       </p>
														
 
															     </section>