|
@@ -1,3 +1,4 @@
|
|
|
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
|
<html>
|
|
|
<head>
|
|
|
<title>Hadoop</title>
|
|
@@ -6,47 +7,110 @@
|
|
|
|
|
|
Hadoop is a distributed computing platform.
|
|
|
|
|
|
-<p>Hadoop primarily consists of a distributed filesystem (DFS, in <a
|
|
|
-href="org/apache/hadoop/dfs/package-summary.html">org.apache.hadoop.dfs</a>)
|
|
|
-and an implementation of a MapReduce distributed data processor (in <a
|
|
|
-href="org/apache/hadoop/mapred/package-summary.html">org.apache.hadoop.mapred
|
|
|
-</a>).</p>
|
|
|
+<p>Hadoop primarily consists of the <a
|
|
|
+href="org/apache/hadoop/dfs/package-summary.html">Hadoop Distributed FileSystem
|
|
|
+(HDFS)</a> and an
|
|
|
+implementation of the <a href="org/apache/hadoop/mapred/package-summary.html">
|
|
|
+Map-Reduce</a> programming paradigm.</p>
|
|
|
+
|
|
|
+
|
|
|
+<p>Hadoop is a software framework that lets one easily write and run applications
|
|
|
+that process vast amounts of data. Here's what makes Hadoop especially useful:</p>
|
|
|
+<ul>
|
|
|
+ <li>
|
|
|
+ <b>Scalable</b>: Hadoop can reliably store and process petabytes.
|
|
|
+ </li>
|
|
|
+ <li>
|
|
|
+ <b>Economical</b>: It distributes the data and processing across clusters
|
|
|
+ of commonly available computers. These clusters can number into the thousands
|
|
|
+ of nodes.
|
|
|
+ </li>
|
|
|
+ <li>
|
|
|
+ <b>Efficient</b>: By distributing the data, Hadoop can process it in parallel
|
|
|
+ on the nodes where the data is located. This makes it extremely rapid.
|
|
|
+ </li>
|
|
|
+ <li>
|
|
|
+ <b>Reliable</b>: Hadoop automatically maintains multiple copies of data and
|
|
|
+ automatically redeploys computing tasks based on failures.
|
|
|
+ </li>
|
|
|
+</ul>
|
|
|
|
|
|
<h2>Requirements</h2>
|
|
|
|
|
|
-<ol>
|
|
|
-
|
|
|
-<li>Java 1.5.x, preferably from <a
|
|
|
- href="http://java.sun.com/j2se/downloads.html">Sun</a> Set
|
|
|
- <tt>JAVA_HOME</tt> to the root of your Java installation.</li>
|
|
|
-
|
|
|
-<li>ssh must be installed and sshd must be running to use Hadoop's
|
|
|
-scripts to manage remote Hadoop daemons. On Ubuntu, this may done
|
|
|
-with <br><tt>sudo apt-get install ssh</tt></li>
|
|
|
-
|
|
|
-<li>rsync must be installed to use Hadoop's scripts to manage remote
|
|
|
-Hadoop installations. On Ubuntu, this may done with <br><tt>sudo
|
|
|
-apt-get install rsync</tt>.</li>
|
|
|
-
|
|
|
-<li>On Win32, <a href="http://www.cygwin.com/">cygwin</a>, for shell
|
|
|
-support. To use Subversion on Win32, select the subversion package
|
|
|
-when you install, in the "Devel" category. Distributed operation has
|
|
|
-not been well tested on Win32, so this should primarily be considered
|
|
|
-a development platform at this point, not a production platform.</li>
|
|
|
+<h3>Platforms</h3>
|
|
|
+
|
|
|
+<ul>
|
|
|
+ <li>
|
|
|
+ Hadoop was been demonstrated on GNU/Linux clusters with 2000 nodes.
|
|
|
+ </li>
|
|
|
+ <li>
|
|
|
+ Win32 is supported as a <i>development</i> platform. Distributed operation
|
|
|
+ has not been well tested on Win32, so this is not a <i>production</i>
|
|
|
+ platform.
|
|
|
+ </li>
|
|
|
+</ul>
|
|
|
|
|
|
+<h3>Requisite Software</h3>
|
|
|
+
|
|
|
+<ol>
|
|
|
+ <li>
|
|
|
+ Java 1.5.x, preferably from
|
|
|
+ <a href="http://java.sun.com/j2se/downloads.html">Sun</a>.
|
|
|
+ Set <tt>JAVA_HOME</tt> to the root of your Java installation.
|
|
|
+ </li>
|
|
|
+ <li>
|
|
|
+ ssh must be installed and sshd must be running to use Hadoop's
|
|
|
+ scripts to manage remote Hadoop daemons.
|
|
|
+ </li>
|
|
|
+ <li>
|
|
|
+ rsync may be installed to use Hadoop's scripts to manage remote
|
|
|
+ Hadoop installations.
|
|
|
+ </li>
|
|
|
</ol>
|
|
|
|
|
|
+<h4>Additional requirements for Windows</h4>
|
|
|
+
|
|
|
+<ol>
|
|
|
+ <li>
|
|
|
+ <a href="http://www.cygwin.com/">Cygwin</a> - Required for shell support in
|
|
|
+ addition to the required software above.
|
|
|
+ </li>
|
|
|
+ <li>
|
|
|
+ Subversion - Optional, for checking-out code from the source repository.
|
|
|
+ </li>
|
|
|
+</ol>
|
|
|
+
|
|
|
+<h3>Installing Required Software</h3>
|
|
|
+
|
|
|
+<p>If your platform does not have the required software listed above, you
|
|
|
+will have to install it.</p>
|
|
|
+
|
|
|
+<p>For example on Ubuntu Linux:</p>
|
|
|
+<p><blockquote><pre>
|
|
|
+$ sudo apt-get install ssh<br>
|
|
|
+$ sudo apt-get install rsync<br>
|
|
|
+</pre></blockquote></p>
|
|
|
+
|
|
|
+<p>On Windows, if you did not install the required software when you
|
|
|
+installed cygwin, start the cygwin installer and select the packages:</p>
|
|
|
+<ul>
|
|
|
+ <li>openssh - the "Net" category</li>
|
|
|
+ <li>rsync - the "Net" category</li>
|
|
|
+ <li>subversion (optional) - the "Devel" category</li>
|
|
|
+</ul>
|
|
|
+
|
|
|
<h2>Getting Started</h2>
|
|
|
|
|
|
<p>First, you need to get a copy of the Hadoop code.</p>
|
|
|
|
|
|
<p>You can download a nightly build from <a
|
|
|
-href="http://cvs.apache.org/dist/lucene/hadoop/nightly/">http://cvs.apache.org/dist/lucene/hadoop/nightly/</a>.
|
|
|
-Unpack the release and connect to its top-level directory.</p>
|
|
|
+href="http://cvs.apache.org/dist/lucene/hadoop/nightly/">
|
|
|
+http://cvs.apache.org/dist/lucene/hadoop/nightly/</a>. Unpack the release and
|
|
|
+connect to its top-level directory.</p>
|
|
|
|
|
|
<p>Or, check out the code from <a
|
|
|
href="http://lucene.apache.org/hadoop/version_control.html">subversion</a>
|
|
|
-and build it with <a href="http://ant.apache.org/">Ant</a>.</p>
|
|
|
+and build it with <a href="http://ant.apache.org/">ant</a>.</p>
|
|
|
|
|
|
<p>Edit the file <tt>conf/hadoop-env.sh</tt> to define at least
|
|
|
<tt>JAVA_HOME</tt>.</p>
|