overview.html 7.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263
  1. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
  2. <html>
  3. <head>
  4. <title>Hadoop</title>
  5. </head>
  6. <body>
  7. Hadoop is a distributed computing platform.
  8. <p>Hadoop primarily consists of the <a
  9. href="org/apache/hadoop/dfs/package-summary.html">Hadoop Distributed FileSystem
  10. (HDFS)</a> and an
  11. implementation of the <a href="org/apache/hadoop/mapred/package-summary.html">
  12. Map-Reduce</a> programming paradigm.</p>
  13. <p>Hadoop is a software framework that lets one easily write and run applications
  14. that process vast amounts of data. Here's what makes Hadoop especially useful:</p>
  15. <ul>
  16. <li>
  17. <b>Scalable</b>: Hadoop can reliably store and process petabytes.
  18. </li>
  19. <li>
  20. <b>Economical</b>: It distributes the data and processing across clusters
  21. of commonly available computers. These clusters can number into the thousands
  22. of nodes.
  23. </li>
  24. <li>
  25. <b>Efficient</b>: By distributing the data, Hadoop can process it in parallel
  26. on the nodes where the data is located. This makes it extremely rapid.
  27. </li>
  28. <li>
  29. <b>Reliable</b>: Hadoop automatically maintains multiple copies of data and
  30. automatically redeploys computing tasks based on failures.
  31. </li>
  32. </ul>
  33. <h2>Requirements</h2>
  34. <h3>Platforms</h3>
  35. <ul>
  36. <li>
  37. Hadoop was been demonstrated on GNU/Linux clusters with 2000 nodes.
  38. </li>
  39. <li>
  40. Win32 is supported as a <i>development</i> platform. Distributed operation
  41. has not been well tested on Win32, so this is not a <i>production</i>
  42. platform.
  43. </li>
  44. </ul>
  45. <h3>Requisite Software</h3>
  46. <ol>
  47. <li>
  48. Java 1.5.x, preferably from
  49. <a href="http://java.sun.com/j2se/downloads.html">Sun</a>.
  50. Set <tt>JAVA_HOME</tt> to the root of your Java installation.
  51. </li>
  52. <li>
  53. ssh must be installed and sshd must be running to use Hadoop's
  54. scripts to manage remote Hadoop daemons.
  55. </li>
  56. <li>
  57. rsync may be installed to use Hadoop's scripts to manage remote
  58. Hadoop installations.
  59. </li>
  60. </ol>
  61. <h4>Additional requirements for Windows</h4>
  62. <ol>
  63. <li>
  64. <a href="http://www.cygwin.com/">Cygwin</a> - Required for shell support in
  65. addition to the required software above.
  66. </li>
  67. </ol>
  68. <h3>Installing Required Software</h3>
  69. <p>If your platform does not have the required software listed above, you
  70. will have to install it.</p>
  71. <p>For example on Ubuntu Linux:</p>
  72. <p><blockquote><pre>
  73. $ sudo apt-get install ssh<br>
  74. $ sudo apt-get install rsync<br>
  75. </pre></blockquote></p>
  76. <p>On Windows, if you did not install the required software when you
  77. installed cygwin, start the cygwin installer and select the packages:</p>
  78. <ul>
  79. <li>openssh - the "Net" category</li>
  80. <li>rsync - the "Net" category</li>
  81. </ul>
  82. <h2>Getting Started</h2>
  83. <p>First, you need to get a copy of the Hadoop code.</p>
  84. <p>Edit the file <tt>conf/hadoop-env.sh</tt> to define at least
  85. <tt>JAVA_HOME</tt>.</p>
  86. <p>Try the following command:</p>
  87. <tt>bin/hadoop</tt>
  88. <p>This will display the documentation for the Hadoop command script.</p>
  89. <h2>Standalone operation</h2>
  90. <p>By default, Hadoop is configured to run things in a non-distributed
  91. mode, as a single Java process. This is useful for debugging, and can
  92. be demonstrated as follows:</p>
  93. <tt>
  94. mkdir input<br>
  95. cp conf/*.xml input<br>
  96. bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'<br>
  97. cat output/*
  98. </tt>
  99. <p>This will display counts for each match of the <a
  100. href="http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html">
  101. regular expression.</a></p>
  102. <p>Note that input is specified as a <em>directory</em> containing input
  103. files and that output is also specified as a directory where parts are
  104. written.</p>
  105. <h2>Distributed operation</h2>
  106. To configure Hadoop for distributed operation you must specify the
  107. following:
  108. <ol>
  109. <li>The {@link org.apache.hadoop.hdfs.server.namenode.NameNode} (Distributed Filesystem
  110. master) host. This is specified with the configuration
  111. property <tt><a
  112. href="../hadoop-default.html#fs.default.name">fs.default.name</a></tt>.
  113. </li>
  114. <li>The {@link org.apache.hadoop.mapred.JobTracker} (MapReduce master)
  115. host and port. This is specified with the configuration property
  116. <tt><a
  117. href="../hadoop-default.html#mapred.job.tracker">mapred.job.tracker</a></tt>.
  118. </li>
  119. <li>A <em>slaves</em> file that lists the names of all the hosts in
  120. the cluster. The default slaves file is <tt>conf/slaves</tt>.
  121. </ol>
  122. <h3>Pseudo-distributed configuration</h3>
  123. You can in fact run everything on a single host. To run things this
  124. way, put the following in conf/hadoop-site.xml:
  125. <xmp><configuration>
  126. <property>
  127. <name>fs.default.name</name>
  128. <value>hdfs://localhost/</value>
  129. </property>
  130. <property>
  131. <name>mapred.job.tracker</name>
  132. <value>localhost:9001</value>
  133. </property>
  134. <property>
  135. <name>dfs.replication</name>
  136. <value>1</value>
  137. </property>
  138. </configuration></xmp>
  139. <p>(We also set the HDFS replication level to 1 in order to
  140. reduce warnings when running on a single node.)</p>
  141. <p>Now check that the command <br><tt>ssh localhost</tt><br> does not
  142. require a password. If it does, execute the following commands:</p>
  143. <p><tt>ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa<br>
  144. cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
  145. </tt></p>
  146. <h3>Bootstrapping</h3>
  147. <p>A new distributed filesystem must be formatted with the following
  148. command, run on the master node:</p>
  149. <p><tt>bin/hadoop namenode -format</tt></p>
  150. <p>The Hadoop daemons are started with the following command:</p>
  151. <p><tt>bin/start-all.sh</tt></p>
  152. <p>Daemon log output is written to the <tt>logs/</tt> directory.</p>
  153. <p>Input files are copied into the distributed filesystem as follows:</p>
  154. <p><tt>bin/hadoop fs -put input input</tt></p>
  155. <h3>Distributed execution</h3>
  156. <p>Things are run as before, but output must be copied locally to
  157. examine it:</p>
  158. <tt>
  159. bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'<br>
  160. bin/hadoop fs -get output output
  161. cat output/*
  162. </tt>
  163. <p>When you're done, stop the daemons with:</p>
  164. <p><tt>bin/stop-all.sh</tt></p>
  165. <h3>Fully-distributed operation</h3>
  166. <p>Fully distributed operation is just like the pseudo-distributed operation
  167. described above, except, in <tt>conf/hadoop-site.xml</tt>, specify:</p>
  168. <ol>
  169. <li>The hostname or IP address of your master server in the value
  170. for <tt><a
  171. href="../hadoop-default.html#fs.default.name">fs.default.name</a></tt>,
  172. as <tt><em>hdfs://master.example.com/</em></tt>.</li>
  173. <li>The host and port of the your master server in the value
  174. of <tt><a href="../hadoop-default.html#mapred.job.tracker">mapred.job.tracker</a></tt>
  175. as <tt><em>master.example.com</em>:<em>port</em></tt>.</li>
  176. <li>Directories for <tt><a
  177. href="../hadoop-default.html#dfs.name.dir">dfs.name.dir</a></tt> and
  178. <tt><a href="../hadoop-default.html#dfs.data.dir">dfs.data.dir</a>.
  179. </tt>These are local directories used to hold distributed filesystem
  180. data on the master node and slave nodes respectively. Note
  181. that <tt>dfs.data.dir</tt> may contain a space- or comma-separated
  182. list of directory names, so that data may be stored on multiple local
  183. devices.</li>
  184. <li><tt><a href="../hadoop-default.html#mapred.local.dir">mapred.local.dir</a></tt>,
  185. the local directory where temporary MapReduce data is stored. It
  186. also may be a list of directories.</li>
  187. <li><tt><a
  188. href="../hadoop-default.html#mapred.map.tasks">mapred.map.tasks</a></tt>
  189. and <tt><a
  190. href="../hadoop-default.html#mapred.reduce.tasks">mapred.reduce.tasks</a></tt>.
  191. As a rule of thumb, use 10x the
  192. number of slave processors for <tt>mapred.map.tasks</tt>, and 2x the
  193. number of slave processors for <tt>mapred.reduce.tasks</tt>.</li>
  194. </ol>
  195. <p>Finally, list all slave hostnames or IP addresses in your
  196. <tt>conf/slaves</tt> file, one per line. Then format your filesystem
  197. and start your cluster on your master node, as above.
  198. </body>
  199. </html>