|
@@ -247,6 +247,14 @@ document.write("Last Published: " + document.lastModified);
|
|
|
</li>
|
|
|
<li>
|
|
|
<a href="#Task+Execution+%26+Environment">Task Execution & Environment</a>
|
|
|
+<ul class="minitoc">
|
|
|
+<li>
|
|
|
+<a href="#Map+Parameters">Map Parameters</a>
|
|
|
+</li>
|
|
|
+<li>
|
|
|
+<a href="#Shuffle%2FReduce+Parameters">Shuffle/Reduce Parameters</a>
|
|
|
+</li>
|
|
|
+</ul>
|
|
|
</li>
|
|
|
<li>
|
|
|
<a href="#Job+Submission+and+Monitoring">Job Submission and Monitoring</a>
|
|
@@ -316,7 +324,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
<a href="#Example%3A+WordCount+v2.0">Example: WordCount v2.0</a>
|
|
|
<ul class="minitoc">
|
|
|
<li>
|
|
|
-<a href="#Source+Code-N10E46">Source Code</a>
|
|
|
+<a href="#Source+Code-N10F30">Source Code</a>
|
|
|
</li>
|
|
|
<li>
|
|
|
<a href="#Sample+Runs">Sample Runs</a>
|
|
@@ -1608,6 +1616,183 @@ document.write("Last Published: " + document.lastModified);
|
|
|
greater than any value specified for a maximum heap-size
|
|
|
of the child jvm via <span class="codefrag">mapred.child.java.opts</span>, or a ulimit
|
|
|
value in <span class="codefrag">mapred.child.ulimit</span>. </p>
|
|
|
+<p>The memory available to some parts of the framework is also
|
|
|
+ configurable. In map and reduce tasks, performance may be influenced
|
|
|
+ by adjusting parameters influencing the concurrency of operations and
|
|
|
+ the frequency with which data will hit disk. Monitoring the filesystem
|
|
|
+ counters for a job- particularly relative to byte counts from the map
|
|
|
+ and into the reduce- is invaluable to the tuning of these
|
|
|
+ parameters.</p>
|
|
|
+<a name="N108E9"></a><a name="Map+Parameters"></a>
|
|
|
+<h4>Map Parameters</h4>
|
|
|
+<p>A record emitted from a map will be serialized into a buffer and
|
|
|
+ metadata will be stored into accounting buffers. As described in the
|
|
|
+ following options, when either the serialization buffer or the
|
|
|
+ metadata exceed a threshold, the contents of the buffers will be
|
|
|
+ sorted and written to disk in the background while the map continues
|
|
|
+ to output records. If either buffer fills completely while the spill
|
|
|
+ is in progress, the map thread will block. When the map is finished,
|
|
|
+ any remaining records are written to disk and all on-disk segments
|
|
|
+ are merged into a single file. Minimizing the number of spills to
|
|
|
+ disk can decrease map time, but a larger buffer also decreases the
|
|
|
+ memory available to the mapper.</p>
|
|
|
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
|
|
|
+
|
|
|
+<tr>
|
|
|
+<th colspan="1" rowspan="1">Name</th><th colspan="1" rowspan="1">Type</th><th colspan="1" rowspan="1">Description</th>
|
|
|
+</tr>
|
|
|
+
|
|
|
+<tr>
|
|
|
+<td colspan="1" rowspan="1">io.sort.mb</td><td colspan="1" rowspan="1">int</td>
|
|
|
+ <td colspan="1" rowspan="1">The cumulative size of the serialization and accounting
|
|
|
+ buffers storing records emitted from the map, in megabytes.
|
|
|
+ </td>
|
|
|
+</tr>
|
|
|
+
|
|
|
+<tr>
|
|
|
+<td colspan="1" rowspan="1">io.sort.record.percent</td><td colspan="1" rowspan="1">float</td>
|
|
|
+ <td colspan="1" rowspan="1">The ratio of serialization to accounting space can be
|
|
|
+ adjusted. Each serialized record requires 16 bytes of
|
|
|
+ accounting information in addition to its serialized size to
|
|
|
+ effect the sort. This percentage of space allocated from
|
|
|
+ <span class="codefrag">io.sort.mb</span> affects the probability of a spill to
|
|
|
+ disk being caused by either exhaustion of the serialization
|
|
|
+ buffer or the accounting space. Clearly, for a map outputting
|
|
|
+ small records, a higher value than the default will likely
|
|
|
+ decrease the number of spills to disk.</td>
|
|
|
+</tr>
|
|
|
+
|
|
|
+<tr>
|
|
|
+<td colspan="1" rowspan="1">io.sort.spill.percent</td><td colspan="1" rowspan="1">float</td>
|
|
|
+ <td colspan="1" rowspan="1">This is the threshold for the accounting and serialization
|
|
|
+ buffers. When this percentage of either buffer has filled,
|
|
|
+ their contents will be spilled to disk in the background. Let
|
|
|
+ <span class="codefrag">io.sort.record.percent</span> be <em>r</em>,
|
|
|
+ <span class="codefrag">io.sort.mb</span> be <em>x</em>, and this value be
|
|
|
+ <em>q</em>. The maximum number of records collected before the
|
|
|
+ collection thread will spill is <span class="codefrag">r * x * q * 2^16</span>.
|
|
|
+ Note that a higher value may decrease the number of- or even
|
|
|
+ eliminate- merges, but will also increase the probability of
|
|
|
+ the map task getting blocked. The lowest average map times are
|
|
|
+ usually obtained by accurately estimating the size of the map
|
|
|
+ output and preventing multiple spills.</td>
|
|
|
+</tr>
|
|
|
+
|
|
|
+</table>
|
|
|
+<p>Other notes</p>
|
|
|
+<ul>
|
|
|
+
|
|
|
+<li>If either spill threshold is exceeded while a spill is in
|
|
|
+ progress, collection will continue until the spill is finished.
|
|
|
+ For example, if <span class="codefrag">io.sort.buffer.spill.percent</span> is set
|
|
|
+ to 0.33, and the remainder of the buffer is filled while the spill
|
|
|
+ runs, the next spill will include all the collected records, or
|
|
|
+ 0.66 of the buffer, and will not generate additional spills. In
|
|
|
+ other words, the thresholds are defining triggers, not
|
|
|
+ blocking.</li>
|
|
|
+
|
|
|
+<li>A record larger than the serialization buffer will first
|
|
|
+ trigger a spill, then be spilled to a separate file. It is
|
|
|
+ undefined whether or not this record will first pass through the
|
|
|
+ combiner.</li>
|
|
|
+
|
|
|
+</ul>
|
|
|
+<a name="N10955"></a><a name="Shuffle%2FReduce+Parameters"></a>
|
|
|
+<h4>Shuffle/Reduce Parameters</h4>
|
|
|
+<p>As described previously, each reduce fetches the output assigned
|
|
|
+ to it by the Partitioner via HTTP into memory and periodically
|
|
|
+ merges these outputs to disk. If intermediate compression of map
|
|
|
+ outputs is turned on, each output is decompressed into memory. The
|
|
|
+ following options affect the frequency of these merges to disk prior
|
|
|
+ to the reduce and the memory allocated to map output during the
|
|
|
+ reduce.</p>
|
|
|
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
|
|
|
+
|
|
|
+<tr>
|
|
|
+<th colspan="1" rowspan="1">Name</th><th colspan="1" rowspan="1">Type</th><th colspan="1" rowspan="1">Description</th>
|
|
|
+</tr>
|
|
|
+
|
|
|
+<tr>
|
|
|
+<td colspan="1" rowspan="1">io.sort.factor</td><td colspan="1" rowspan="1">int</td>
|
|
|
+ <td colspan="1" rowspan="1">Specifies the number of segments on disk to be merged at
|
|
|
+ the same time. It limits the number of open files and
|
|
|
+ compression codecs during the merge. If the number of files
|
|
|
+ exceeds this limit, the merge will proceed in several passes.
|
|
|
+ Though this limit also applies to the map, most jobs should be
|
|
|
+ configured so that hitting this limit is unlikely
|
|
|
+ there.</td>
|
|
|
+</tr>
|
|
|
+
|
|
|
+<tr>
|
|
|
+<td colspan="1" rowspan="1">mapred.inmem.merge.threshold</td><td colspan="1" rowspan="1">int</td>
|
|
|
+ <td colspan="1" rowspan="1">The number of sorted map outputs fetched into memory
|
|
|
+ before being merged to disk. Like the spill thresholds in the
|
|
|
+ preceding note, this is not defining a unit of partition, but
|
|
|
+ a trigger. In practice, this is usually set very high (1000)
|
|
|
+ or disabled (0), since merging in-memory segments is often
|
|
|
+ less expensive than merging from disk (see notes following
|
|
|
+ this table). This threshold influences only the frequency of
|
|
|
+ in-memory merges during the shuffle.</td>
|
|
|
+</tr>
|
|
|
+
|
|
|
+<tr>
|
|
|
+<td colspan="1" rowspan="1">mapred.job.shuffle.merge.percent</td><td colspan="1" rowspan="1">float</td>
|
|
|
+ <td colspan="1" rowspan="1">The memory threshold for fetched map outputs before an
|
|
|
+ in-memory merge is started, expressed as a percentage of
|
|
|
+ memory allocated to storing map outputs in memory. Since map
|
|
|
+ outputs that can't fit in memory can be stalled, setting this
|
|
|
+ high may decrease parallelism between the fetch and merge.
|
|
|
+ Conversely, values as high as 1.0 have been effective for
|
|
|
+ reduces whose input can fit entirely in memory. This parameter
|
|
|
+ influences only the frequency of in-memory merges during the
|
|
|
+ shuffle.</td>
|
|
|
+</tr>
|
|
|
+
|
|
|
+<tr>
|
|
|
+<td colspan="1" rowspan="1">mapred.job.shuffle.input.buffer.percent</td><td colspan="1" rowspan="1">float</td>
|
|
|
+ <td colspan="1" rowspan="1">The percentage of memory- relative to the maximum heapsize
|
|
|
+ as typically specified in <span class="codefrag">mapred.child.java.opts</span>-
|
|
|
+ that can be allocated to storing map outputs during the
|
|
|
+ shuffle. Though some memory should be set aside for the
|
|
|
+ framework, in general it is advantageous to set this high
|
|
|
+ enough to store large and numerous map outputs.</td>
|
|
|
+</tr>
|
|
|
+
|
|
|
+<tr>
|
|
|
+<td colspan="1" rowspan="1">mapred.job.reduce.input.buffer.percent</td><td colspan="1" rowspan="1">float</td>
|
|
|
+ <td colspan="1" rowspan="1">The percentage of memory relative to the maximum heapsize
|
|
|
+ in which map outputs may be retained during the reduce. When
|
|
|
+ the reduce begins, map outputs will be merged to disk until
|
|
|
+ those that remain are under the resource limit this defines.
|
|
|
+ By default, all map outputs are merged to disk before the
|
|
|
+ reduce begins to maximize the memory available to the reduce.
|
|
|
+ For less memory-intensive reduces, this should be increased to
|
|
|
+ avoid trips to disk.</td>
|
|
|
+</tr>
|
|
|
+
|
|
|
+</table>
|
|
|
+<p>Other notes</p>
|
|
|
+<ul>
|
|
|
+
|
|
|
+<li>If a map output is larger than 25 percent of the memory
|
|
|
+ allocated to copying map outputs, it will be written directly to
|
|
|
+ disk without first staging through memory.</li>
|
|
|
+
|
|
|
+<li>When running with a combiner, the reasoning about high merge
|
|
|
+ thresholds and large buffers may not hold. For merges started
|
|
|
+ before all map outputs have been fetched, the combiner is run
|
|
|
+ while spilling to disk. In some cases, one can obtain better
|
|
|
+ reduce times by spending resources combining map outputs- making
|
|
|
+ disk spills small and parallelizing spilling and fetching- rather
|
|
|
+ than aggressively increasing buffer sizes.</li>
|
|
|
+
|
|
|
+<li>When merging in-memory map outputs to disk to begin the
|
|
|
+ reduce, if an intermediate merge is necessary because there are
|
|
|
+ segments to spill and at least <span class="codefrag">io.sort.factor</span>
|
|
|
+ segments already on disk, the in-memory map outputs will be part
|
|
|
+ of the intermediate merge.</li>
|
|
|
+
|
|
|
+</ul>
|
|
|
<p>The task tracker has local directory,
|
|
|
<span class="codefrag"> ${mapred.local.dir}/taskTracker/</span> to create localized
|
|
|
cache and localized job. It can define multiple local directories
|
|
@@ -1786,7 +1971,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
<a href="native_libraries.html#Loading+native+libraries+through+DistributedCache">
|
|
|
native_libraries.html</a>
|
|
|
</p>
|
|
|
-<a name="N10A23"></a><a name="Job+Submission+and+Monitoring"></a>
|
|
|
+<a name="N10B0D"></a><a name="Job+Submission+and+Monitoring"></a>
|
|
|
<h3 class="h4">Job Submission and Monitoring</h3>
|
|
|
<p>
|
|
|
<a href="api/org/apache/hadoop/mapred/JobClient.html">
|
|
@@ -1847,7 +2032,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
<p>Normally the user creates the application, describes various facets
|
|
|
of the job via <span class="codefrag">JobConf</span>, and then uses the
|
|
|
<span class="codefrag">JobClient</span> to submit the job and monitor its progress.</p>
|
|
|
-<a name="N10A83"></a><a name="Job+Control"></a>
|
|
|
+<a name="N10B6D"></a><a name="Job+Control"></a>
|
|
|
<h4>Job Control</h4>
|
|
|
<p>Users may need to chain Map/Reduce jobs to accomplish complex
|
|
|
tasks which cannot be done via a single Map/Reduce job. This is fairly
|
|
@@ -1883,7 +2068,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
</li>
|
|
|
|
|
|
</ul>
|
|
|
-<a name="N10AAD"></a><a name="Job+Input"></a>
|
|
|
+<a name="N10B97"></a><a name="Job+Input"></a>
|
|
|
<h3 class="h4">Job Input</h3>
|
|
|
<p>
|
|
|
<a href="api/org/apache/hadoop/mapred/InputFormat.html">
|
|
@@ -1931,7 +2116,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
appropriate <span class="codefrag">CompressionCodec</span>. However, it must be noted that
|
|
|
compressed files with the above extensions cannot be <em>split</em> and
|
|
|
each compressed file is processed in its entirety by a single mapper.</p>
|
|
|
-<a name="N10B17"></a><a name="InputSplit"></a>
|
|
|
+<a name="N10C01"></a><a name="InputSplit"></a>
|
|
|
<h4>InputSplit</h4>
|
|
|
<p>
|
|
|
<a href="api/org/apache/hadoop/mapred/InputSplit.html">
|
|
@@ -1945,7 +2130,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
FileSplit</a> is the default <span class="codefrag">InputSplit</span>. It sets
|
|
|
<span class="codefrag">map.input.file</span> to the path of the input file for the
|
|
|
logical split.</p>
|
|
|
-<a name="N10B3C"></a><a name="RecordReader"></a>
|
|
|
+<a name="N10C26"></a><a name="RecordReader"></a>
|
|
|
<h4>RecordReader</h4>
|
|
|
<p>
|
|
|
<a href="api/org/apache/hadoop/mapred/RecordReader.html">
|
|
@@ -1957,7 +2142,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
for processing. <span class="codefrag">RecordReader</span> thus assumes the
|
|
|
responsibility of processing record boundaries and presents the tasks
|
|
|
with keys and values.</p>
|
|
|
-<a name="N10B5F"></a><a name="Job+Output"></a>
|
|
|
+<a name="N10C49"></a><a name="Job+Output"></a>
|
|
|
<h3 class="h4">Job Output</h3>
|
|
|
<p>
|
|
|
<a href="api/org/apache/hadoop/mapred/OutputFormat.html">
|
|
@@ -1982,7 +2167,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
<p>
|
|
|
<span class="codefrag">TextOutputFormat</span> is the default
|
|
|
<span class="codefrag">OutputFormat</span>.</p>
|
|
|
-<a name="N10B88"></a><a name="OutputCommitter"></a>
|
|
|
+<a name="N10C72"></a><a name="OutputCommitter"></a>
|
|
|
<h4>OutputCommitter</h4>
|
|
|
<p>
|
|
|
<a href="api/org/apache/hadoop/mapred/OutputCommitter.html">
|
|
@@ -2026,7 +2211,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
<p>
|
|
|
<span class="codefrag">FileOutputCommitter</span> is the default
|
|
|
<span class="codefrag">OutputCommitter</span>.</p>
|
|
|
-<a name="N10BB8"></a><a name="Task+Side-Effect+Files"></a>
|
|
|
+<a name="N10CA2"></a><a name="Task+Side-Effect+Files"></a>
|
|
|
<h4>Task Side-Effect Files</h4>
|
|
|
<p>In some applications, component tasks need to create and/or write to
|
|
|
side-files, which differ from the actual job-output files.</p>
|
|
@@ -2067,7 +2252,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
<p>The entire discussion holds true for maps of jobs with
|
|
|
reducer=NONE (i.e. 0 reduces) since output of the map, in that case,
|
|
|
goes directly to HDFS.</p>
|
|
|
-<a name="N10C06"></a><a name="RecordWriter"></a>
|
|
|
+<a name="N10CF0"></a><a name="RecordWriter"></a>
|
|
|
<h4>RecordWriter</h4>
|
|
|
<p>
|
|
|
<a href="api/org/apache/hadoop/mapred/RecordWriter.html">
|
|
@@ -2075,9 +2260,9 @@ document.write("Last Published: " + document.lastModified);
|
|
|
pairs to an output file.</p>
|
|
|
<p>RecordWriter implementations write the job outputs to the
|
|
|
<span class="codefrag">FileSystem</span>.</p>
|
|
|
-<a name="N10C1D"></a><a name="Other+Useful+Features"></a>
|
|
|
+<a name="N10D07"></a><a name="Other+Useful+Features"></a>
|
|
|
<h3 class="h4">Other Useful Features</h3>
|
|
|
-<a name="N10C23"></a><a name="Counters"></a>
|
|
|
+<a name="N10D0D"></a><a name="Counters"></a>
|
|
|
<h4>Counters</h4>
|
|
|
<p>
|
|
|
<span class="codefrag">Counters</span> represent global counters, defined either by
|
|
@@ -2094,7 +2279,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
in the <span class="codefrag">map</span> and/or
|
|
|
<span class="codefrag">reduce</span> methods. These counters are then globally
|
|
|
aggregated by the framework.</p>
|
|
|
-<a name="N10C52"></a><a name="DistributedCache"></a>
|
|
|
+<a name="N10D3C"></a><a name="DistributedCache"></a>
|
|
|
<h4>DistributedCache</h4>
|
|
|
<p>
|
|
|
<a href="api/org/apache/hadoop/filecache/DistributedCache.html">
|
|
@@ -2165,7 +2350,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
<span class="codefrag">mapred.job.classpath.{files|archives}</span>. Similarly the
|
|
|
cached files that are symlinked into the working directory of the
|
|
|
task can be used to distribute native libraries and load them.</p>
|
|
|
-<a name="N10CD5"></a><a name="Tool"></a>
|
|
|
+<a name="N10DBF"></a><a name="Tool"></a>
|
|
|
<h4>Tool</h4>
|
|
|
<p>The <a href="api/org/apache/hadoop/util/Tool.html">Tool</a>
|
|
|
interface supports the handling of generic Hadoop command-line options.
|
|
@@ -2205,7 +2390,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
</span>
|
|
|
|
|
|
</p>
|
|
|
-<a name="N10D07"></a><a name="IsolationRunner"></a>
|
|
|
+<a name="N10DF1"></a><a name="IsolationRunner"></a>
|
|
|
<h4>IsolationRunner</h4>
|
|
|
<p>
|
|
|
<a href="api/org/apache/hadoop/mapred/IsolationRunner.html">
|
|
@@ -2229,7 +2414,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
<p>
|
|
|
<span class="codefrag">IsolationRunner</span> will run the failed task in a single
|
|
|
jvm, which can be in the debugger, over precisely the same input.</p>
|
|
|
-<a name="N10D3A"></a><a name="Profiling"></a>
|
|
|
+<a name="N10E24"></a><a name="Profiling"></a>
|
|
|
<h4>Profiling</h4>
|
|
|
<p>Profiling is a utility to get a representative (2 or 3) sample
|
|
|
of built-in java profiler for a sample of maps and reduces. </p>
|
|
@@ -2262,7 +2447,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
<span class="codefrag">-agentlib:hprof=cpu=samples,heap=sites,force=n,thread=y,verbose=n,file=%s</span>
|
|
|
|
|
|
</p>
|
|
|
-<a name="N10D6E"></a><a name="Debugging"></a>
|
|
|
+<a name="N10E58"></a><a name="Debugging"></a>
|
|
|
<h4>Debugging</h4>
|
|
|
<p>Map/Reduce framework provides a facility to run user-provided
|
|
|
scripts for debugging. When map/reduce task fails, user can run
|
|
@@ -2273,14 +2458,14 @@ document.write("Last Published: " + document.lastModified);
|
|
|
<p> In the following sections we discuss how to submit debug script
|
|
|
along with the job. For submitting debug script, first it has to
|
|
|
distributed. Then the script has to supplied in Configuration. </p>
|
|
|
-<a name="N10D7A"></a><a name="How+to+distribute+script+file%3A"></a>
|
|
|
+<a name="N10E64"></a><a name="How+to+distribute+script+file%3A"></a>
|
|
|
<h5> How to distribute script file: </h5>
|
|
|
<p>
|
|
|
The user has to use
|
|
|
<a href="mapred_tutorial.html#DistributedCache">DistributedCache</a>
|
|
|
mechanism to <em>distribute</em> and <em>symlink</em> the
|
|
|
debug script file.</p>
|
|
|
-<a name="N10D8E"></a><a name="How+to+submit+script%3A"></a>
|
|
|
+<a name="N10E78"></a><a name="How+to+submit+script%3A"></a>
|
|
|
<h5> How to submit script: </h5>
|
|
|
<p> A quick way to submit debug script is to set values for the
|
|
|
properties "mapred.map.task.debug.script" and
|
|
@@ -2304,17 +2489,17 @@ document.write("Last Published: " + document.lastModified);
|
|
|
<span class="codefrag">$script $stdout $stderr $syslog $jobconf $program </span>
|
|
|
|
|
|
</p>
|
|
|
-<a name="N10DB0"></a><a name="Default+Behavior%3A"></a>
|
|
|
+<a name="N10E9A"></a><a name="Default+Behavior%3A"></a>
|
|
|
<h5> Default Behavior: </h5>
|
|
|
<p> For pipes, a default script is run to process core dumps under
|
|
|
gdb, prints stack trace and gives info about running threads. </p>
|
|
|
-<a name="N10DBB"></a><a name="JobControl"></a>
|
|
|
+<a name="N10EA5"></a><a name="JobControl"></a>
|
|
|
<h4>JobControl</h4>
|
|
|
<p>
|
|
|
<a href="api/org/apache/hadoop/mapred/jobcontrol/package-summary.html">
|
|
|
JobControl</a> is a utility which encapsulates a set of Map/Reduce jobs
|
|
|
and their dependencies.</p>
|
|
|
-<a name="N10DC8"></a><a name="Data+Compression"></a>
|
|
|
+<a name="N10EB2"></a><a name="Data+Compression"></a>
|
|
|
<h4>Data Compression</h4>
|
|
|
<p>Hadoop Map/Reduce provides facilities for the application-writer to
|
|
|
specify compression for both intermediate map-outputs and the
|
|
@@ -2328,7 +2513,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
codecs for reasons of both performance (zlib) and non-availability of
|
|
|
Java libraries (lzo). More details on their usage and availability are
|
|
|
available <a href="native_libraries.html">here</a>.</p>
|
|
|
-<a name="N10DE8"></a><a name="Intermediate+Outputs"></a>
|
|
|
+<a name="N10ED2"></a><a name="Intermediate+Outputs"></a>
|
|
|
<h5>Intermediate Outputs</h5>
|
|
|
<p>Applications can control compression of intermediate map-outputs
|
|
|
via the
|
|
@@ -2337,7 +2522,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
<span class="codefrag">CompressionCodec</span> to be used via the
|
|
|
<a href="api/org/apache/hadoop/mapred/JobConf.html#setMapOutputCompressorClass(java.lang.Class)">
|
|
|
JobConf.setMapOutputCompressorClass(Class)</a> api.</p>
|
|
|
-<a name="N10DFD"></a><a name="Job+Outputs"></a>
|
|
|
+<a name="N10EE7"></a><a name="Job+Outputs"></a>
|
|
|
<h5>Job Outputs</h5>
|
|
|
<p>Applications can control compression of job-outputs via the
|
|
|
<a href="api/org/apache/hadoop/mapred/FileOutputFormat.html#setCompressOutput(org.apache.hadoop.mapred.JobConf,%20boolean)">
|
|
@@ -2357,7 +2542,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
</div>
|
|
|
|
|
|
|
|
|
-<a name="N10E2C"></a><a name="Example%3A+WordCount+v2.0"></a>
|
|
|
+<a name="N10F16"></a><a name="Example%3A+WordCount+v2.0"></a>
|
|
|
<h2 class="h3">Example: WordCount v2.0</h2>
|
|
|
<div class="section">
|
|
|
<p>Here is a more complete <span class="codefrag">WordCount</span> which uses many of the
|
|
@@ -2367,7 +2552,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
<a href="quickstart.html#SingleNodeSetup">pseudo-distributed</a> or
|
|
|
<a href="quickstart.html#Fully-Distributed+Operation">fully-distributed</a>
|
|
|
Hadoop installation.</p>
|
|
|
-<a name="N10E46"></a><a name="Source+Code-N10E46"></a>
|
|
|
+<a name="N10F30"></a><a name="Source+Code-N10F30"></a>
|
|
|
<h3 class="h4">Source Code</h3>
|
|
|
<table class="ForrestTable" cellspacing="1" cellpadding="4">
|
|
|
|
|
@@ -3577,7 +3762,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
</tr>
|
|
|
|
|
|
</table>
|
|
|
-<a name="N115A8"></a><a name="Sample+Runs"></a>
|
|
|
+<a name="N11692"></a><a name="Sample+Runs"></a>
|
|
|
<h3 class="h4">Sample Runs</h3>
|
|
|
<p>Sample text-files as input:</p>
|
|
|
<p>
|
|
@@ -3745,7 +3930,7 @@ document.write("Last Published: " + document.lastModified);
|
|
|
<br>
|
|
|
|
|
|
</p>
|
|
|
-<a name="N1167C"></a><a name="Highlights"></a>
|
|
|
+<a name="N11766"></a><a name="Highlights"></a>
|
|
|
<h3 class="h4">Highlights</h3>
|
|
|
<p>The second version of <span class="codefrag">WordCount</span> improves upon the
|
|
|
previous one by using some features offered by the Map/Reduce framework:
|