|
@@ -0,0 +1,587 @@
|
|
|
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
|
|
|
+<html>
|
|
|
+<head>
|
|
|
+<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
|
|
|
+<meta content="Apache Forrest" name="Generator">
|
|
|
+<meta name="Forrest-version" content="0.8">
|
|
|
+<meta name="Forrest-skin-name" content="pelt">
|
|
|
+<title>DistCp</title>
|
|
|
+<link type="text/css" href="skin/basic.css" rel="stylesheet">
|
|
|
+<link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
|
|
|
+<link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
|
|
|
+<link type="text/css" href="skin/profile.css" rel="stylesheet">
|
|
|
+<script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
|
|
|
+<link rel="shortcut icon" href="images/favicon.ico">
|
|
|
+</head>
|
|
|
+<body onload="init()">
|
|
|
+<script type="text/javascript">ndeSetTextSize();</script>
|
|
|
+<div id="top">
|
|
|
+<!--+
|
|
|
+ |breadtrail
|
|
|
+ +-->
|
|
|
+<div class="breadtrail">
|
|
|
+<a href="http://www.apache.org/">Apache</a> > <a href="http://hadoop.apache.org/">Hadoop</a> > <a href="http://hadoop.apache.org/core/">Core</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
|
|
|
+</div>
|
|
|
+<!--+
|
|
|
+ |header
|
|
|
+ +-->
|
|
|
+<div class="header">
|
|
|
+<!--+
|
|
|
+ |start group logo
|
|
|
+ +-->
|
|
|
+<div class="grouplogo">
|
|
|
+<a href="http://hadoop.apache.org/"><img class="logoImage" alt="Hadoop" src="images/hadoop-logo.jpg" title="Apache Hadoop"></a>
|
|
|
+</div>
|
|
|
+<!--+
|
|
|
+ |end group logo
|
|
|
+ +-->
|
|
|
+<!--+
|
|
|
+ |start Project Logo
|
|
|
+ +-->
|
|
|
+<div class="projectlogo">
|
|
|
+<a href="http://hadoop.apache.org/core/"><img class="logoImage" alt="Hadoop" src="images/core-logo.gif" title="Scalable Computing Platform"></a>
|
|
|
+</div>
|
|
|
+<!--+
|
|
|
+ |end Project Logo
|
|
|
+ +-->
|
|
|
+<!--+
|
|
|
+ |start Search
|
|
|
+ +-->
|
|
|
+<div class="searchbox">
|
|
|
+<form action="http://www.google.com/search" method="get" class="roundtopsmall">
|
|
|
+<input value="hadoop.apache.org" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="25" name="q" id="query" type="text" value="Search the site with google">
|
|
|
+ <input name="Search" value="Search" type="submit">
|
|
|
+</form>
|
|
|
+</div>
|
|
|
+<!--+
|
|
|
+ |end search
|
|
|
+ +-->
|
|
|
+<!--+
|
|
|
+ |start Tabs
|
|
|
+ +-->
|
|
|
+<ul id="tabs">
|
|
|
+<li>
|
|
|
+<a class="unselected" href="http://hadoop.apache.org/core/">Project</a>
|
|
|
+</li>
|
|
|
+<li>
|
|
|
+<a class="unselected" href="http://wiki.apache.org/hadoop">Wiki</a>
|
|
|
+</li>
|
|
|
+<li class="current">
|
|
|
+<a class="selected" href="index.html">Hadoop 0.18 Documentation</a>
|
|
|
+</li>
|
|
|
+</ul>
|
|
|
+<!--+
|
|
|
+ |end Tabs
|
|
|
+ +-->
|
|
|
+</div>
|
|
|
+</div>
|
|
|
+<div id="main">
|
|
|
+<div id="publishedStrip">
|
|
|
+<!--+
|
|
|
+ |start Subtabs
|
|
|
+ +-->
|
|
|
+<div id="level2tabs"></div>
|
|
|
+<!--+
|
|
|
+ |end Endtabs
|
|
|
+ +-->
|
|
|
+<script type="text/javascript"><!--
|
|
|
+document.write("Last Published: " + document.lastModified);
|
|
|
+// --></script>
|
|
|
+</div>
|
|
|
+<!--+
|
|
|
+ |breadtrail
|
|
|
+ +-->
|
|
|
+<div class="breadtrail">
|
|
|
+
|
|
|
+
|
|
|
+ </div>
|
|
|
+<!--+
|
|
|
+ |start Menu, mainarea
|
|
|
+ +-->
|
|
|
+<!--+
|
|
|
+ |start Menu
|
|
|
+ +-->
|
|
|
+<div id="menu">
|
|
|
+<div onclick="SwitchMenu('menu_selected_1.1', 'skin/')" id="menu_selected_1.1Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Documentation</div>
|
|
|
+<div id="menu_selected_1.1" class="selectedmenuitemgroup" style="display: block;">
|
|
|
+<div class="menuitem">
|
|
|
+<a href="index.html">Overview</a>
|
|
|
+</div>
|
|
|
+<div class="menuitem">
|
|
|
+<a href="quickstart.html">Quickstart</a>
|
|
|
+</div>
|
|
|
+<div class="menuitem">
|
|
|
+<a href="cluster_setup.html">Cluster Setup</a>
|
|
|
+</div>
|
|
|
+<div class="menuitem">
|
|
|
+<a href="hdfs_design.html">HDFS Architecture</a>
|
|
|
+</div>
|
|
|
+<div class="menuitem">
|
|
|
+<a href="hdfs_user_guide.html">HDFS User Guide</a>
|
|
|
+</div>
|
|
|
+<div class="menuitem">
|
|
|
+<a href="hdfs_permissions_guide.html">HDFS Permissions Guide</a>
|
|
|
+</div>
|
|
|
+<div class="menuitem">
|
|
|
+<a href="hdfs_quota_admin_guide.html">HDFS Quotas Administrator Guide</a>
|
|
|
+</div>
|
|
|
+<div class="menuitem">
|
|
|
+<a href="hdfs_shell.html">FS Shell Guide</a>
|
|
|
+</div>
|
|
|
+<div class="menupage">
|
|
|
+<div class="menupagetitle">DistCp Guide</div>
|
|
|
+</div>
|
|
|
+<div class="menuitem">
|
|
|
+<a href="mapred_tutorial.html">Map-Reduce Tutorial</a>
|
|
|
+</div>
|
|
|
+<div class="menuitem">
|
|
|
+<a href="native_libraries.html">Native Hadoop Libraries</a>
|
|
|
+</div>
|
|
|
+<div class="menuitem">
|
|
|
+<a href="streaming.html">Streaming</a>
|
|
|
+</div>
|
|
|
+<div class="menuitem">
|
|
|
+<a href="hadoop_archives.html">Hadoop Archives</a>
|
|
|
+</div>
|
|
|
+<div class="menuitem">
|
|
|
+<a href="hod.html">Hadoop On Demand</a>
|
|
|
+</div>
|
|
|
+<div class="menuitem">
|
|
|
+<a href="api/index.html">API Docs</a>
|
|
|
+</div>
|
|
|
+<div class="menuitem">
|
|
|
+<a href="http://wiki.apache.org/hadoop/">Wiki</a>
|
|
|
+</div>
|
|
|
+<div class="menuitem">
|
|
|
+<a href="http://wiki.apache.org/hadoop/FAQ">FAQ</a>
|
|
|
+</div>
|
|
|
+<div class="menuitem">
|
|
|
+<a href="http://hadoop.apache.org/core/mailing_lists.html">Mailing Lists</a>
|
|
|
+</div>
|
|
|
+<div class="menuitem">
|
|
|
+<a href="releasenotes.html">Release Notes</a>
|
|
|
+</div>
|
|
|
+<div class="menuitem">
|
|
|
+<a href="changes.html">All Changes</a>
|
|
|
+</div>
|
|
|
+</div>
|
|
|
+<div id="credit"></div>
|
|
|
+<div id="roundbottom">
|
|
|
+<img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
|
|
|
+<!--+
|
|
|
+ |alternative credits
|
|
|
+ +-->
|
|
|
+<div id="credit2"></div>
|
|
|
+</div>
|
|
|
+<!--+
|
|
|
+ |end Menu
|
|
|
+ +-->
|
|
|
+<!--+
|
|
|
+ |start content
|
|
|
+ +-->
|
|
|
+<div id="content">
|
|
|
+<div title="Portable Document Format" class="pdflink">
|
|
|
+<a class="dida" href="distcp.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br>
|
|
|
+ PDF</a>
|
|
|
+</div>
|
|
|
+<h1>DistCp</h1>
|
|
|
+<div id="minitoc-area">
|
|
|
+<ul class="minitoc">
|
|
|
+<li>
|
|
|
+<a href="#Overview">Overview</a>
|
|
|
+</li>
|
|
|
+<li>
|
|
|
+<a href="#Usage">Usage</a>
|
|
|
+<ul class="minitoc">
|
|
|
+<li>
|
|
|
+<a href="#Basic">Basic</a>
|
|
|
+</li>
|
|
|
+<li>
|
|
|
+<a href="#options">Options</a>
|
|
|
+<ul class="minitoc">
|
|
|
+<li>
|
|
|
+<a href="#Option+Index">Option Index</a>
|
|
|
+</li>
|
|
|
+<li>
|
|
|
+<a href="#uo">Update and Overwrite</a>
|
|
|
+</li>
|
|
|
+</ul>
|
|
|
+</li>
|
|
|
+</ul>
|
|
|
+</li>
|
|
|
+<li>
|
|
|
+<a href="#etc">Appendix</a>
|
|
|
+<ul class="minitoc">
|
|
|
+<li>
|
|
|
+<a href="#Map+sizing">Map sizing</a>
|
|
|
+</li>
|
|
|
+<li>
|
|
|
+<a href="#cpver">Copying between versions of HDFS</a>
|
|
|
+</li>
|
|
|
+<li>
|
|
|
+<a href="#Map%2FReduce+and+other+side-effects">Map/Reduce and other side-effects</a>
|
|
|
+</li>
|
|
|
+</ul>
|
|
|
+</li>
|
|
|
+</ul>
|
|
|
+</div>
|
|
|
+
|
|
|
+
|
|
|
+<a name="N1000D"></a><a name="Overview"></a>
|
|
|
+<h2 class="h3">Overview</h2>
|
|
|
+<div class="section">
|
|
|
+<p>DistCp (distributed copy) is a tool used for large inter/intra-cluster
|
|
|
+ copying. It uses map/reduce to effect its distribution, error
|
|
|
+ handling/recovery, and reporting. It expands a list of files and
|
|
|
+ directories into input to map tasks, each of which will copy a partition
|
|
|
+ of the files specified in the source list. Its map/reduce pedigree has
|
|
|
+ endowed it with some quirks in both its semantics and execution. The
|
|
|
+ purpose of this document is to offer guidance for common tasks and to
|
|
|
+ elucidate its model.</p>
|
|
|
+</div>
|
|
|
+
|
|
|
+
|
|
|
+<a name="N10017"></a><a name="Usage"></a>
|
|
|
+<h2 class="h3">Usage</h2>
|
|
|
+<div class="section">
|
|
|
+<a name="N1001D"></a><a name="Basic"></a>
|
|
|
+<h3 class="h4">Basic</h3>
|
|
|
+<p>The most common invocation of DistCp is an inter-cluster copy:</p>
|
|
|
+<p>
|
|
|
+<span class="codefrag">bash$ hadoop distcp hdfs://nn1:8020/foo/bar \</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag">
|
|
|
+
|
|
|
+ hdfs://nn2:8020/bar/foo</span>
|
|
|
+</p>
|
|
|
+<p>This will expand the namespace under <span class="codefrag">/foo/bar</span> on nn1
|
|
|
+ into a temporary file, partition its contents among a set of map
|
|
|
+ tasks, and start a copy on each TaskTracker from nn1 to nn2. Note
|
|
|
+ that DistCp expects absolute paths.</p>
|
|
|
+<p>One can also specify multiple source directories on the command
|
|
|
+ line:</p>
|
|
|
+<p>
|
|
|
+<span class="codefrag">bash$ hadoop distcp hdfs://nn1:8020/foo/a \</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag">
|
|
|
+
|
|
|
+ hdfs://nn1:8020/foo/b \</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag">
|
|
|
+
|
|
|
+ hdfs://nn2:8020/bar/foo</span>
|
|
|
+</p>
|
|
|
+<p>Or, equivalently, from a file using the <span class="codefrag">-f</span> option:<br>
|
|
|
+
|
|
|
+<span class="codefrag">bash$ hadoop distcp -f hdfs://nn1:8020/srclist \</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag">
|
|
|
+
|
|
|
+ hdfs://nn2:8020/bar/foo</span>
|
|
|
+<br>
|
|
|
+</p>
|
|
|
+<p>Where <span class="codefrag">srclist</span> contains<br>
|
|
|
+
|
|
|
+<span class="codefrag"> hdfs://nn1:8020/foo/a</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag"> hdfs://nn1:8020/foo/b</span>
|
|
|
+</p>
|
|
|
+<p>When copying from multiple sources, DistCp will abort the copy with
|
|
|
+ an error message if two sources collide, but collisions at the
|
|
|
+ destination are resolved per the <a href="#options">options</a>
|
|
|
+ specified. By default, files already existing at the destination are
|
|
|
+ skipped (i.e. not replaced by the source file). A count of skipped
|
|
|
+ files is reported at the end of each job, but it may be inaccurate if a
|
|
|
+ copier failed for some subset of its files, but succeeded on a later
|
|
|
+ attempt (see <a href="#etc">Appendix</a>).</p>
|
|
|
+<p>It is important that each TaskTracker can reach and communicate with
|
|
|
+ both the source and destination filesystems. For hdfs, both the source
|
|
|
+ and destination must be running the same version of the protocol or use
|
|
|
+ a backwards-compatible protocol (see <a href="#cpver">Copying Between
|
|
|
+ Versions</a>).</p>
|
|
|
+<p>After a copy, it is recommended that one generates and cross-checks
|
|
|
+ a listing of the source and destination to verify that the copy was
|
|
|
+ truly successful. Since DistCp employs both map/reduce and the
|
|
|
+ FileSystem API, issues in or between any of the three could adversely
|
|
|
+ and silently affect the copy. Some have had success running with
|
|
|
+ <span class="codefrag">-update</span> enabled to perform a second pass, but users should
|
|
|
+ be acquainted with its semantics before attempting this.</p>
|
|
|
+<p>It's also worth noting that if another client is still writing to a
|
|
|
+ source file, the copy will likely fail. Attempting to overwrite a file
|
|
|
+ being written at the destination should also fail on HDFS. If a source
|
|
|
+ file is (re)moved before it is copied, the copy will fail with a
|
|
|
+ FileNotFoundException.</p>
|
|
|
+<a name="N1007E"></a><a name="options"></a>
|
|
|
+<h3 class="h4">Options</h3>
|
|
|
+<a name="N10084"></a><a name="Option+Index"></a>
|
|
|
+<h4>Option Index</h4>
|
|
|
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
|
|
|
+
|
|
|
+<tr>
|
|
|
+<th colspan="1" rowspan="1"> Flag </th><th colspan="1" rowspan="1"> Description </th><th colspan="1" rowspan="1"> Notes </th>
|
|
|
+</tr>
|
|
|
+
|
|
|
+
|
|
|
+<tr>
|
|
|
+<td colspan="1" rowspan="1"><span class="codefrag">-p[rbugp]</span></td>
|
|
|
+ <td colspan="1" rowspan="1">Preserve<br>
|
|
|
+ r: replication number<br>
|
|
|
+ b: block size<br>
|
|
|
+ u: user<br>
|
|
|
+ g: group<br>
|
|
|
+ p: permission<br>
|
|
|
+</td>
|
|
|
+ <td colspan="1" rowspan="1">Modification times are not preserved. Also, when
|
|
|
+ <span class="codefrag">-update</span> is specified, status updates will
|
|
|
+ <strong>not</strong> be synchronized unless the file sizes
|
|
|
+ also differ (i.e. unless the file is re-created).
|
|
|
+ </td>
|
|
|
+</tr>
|
|
|
+
|
|
|
+<tr>
|
|
|
+<td colspan="1" rowspan="1"><span class="codefrag">-i</span></td>
|
|
|
+ <td colspan="1" rowspan="1">Ignore failures</td>
|
|
|
+ <td colspan="1" rowspan="1">As explained in the <a href="#etc">Appendix</a>, this option
|
|
|
+ will keep more accurate statistics about the copy than the
|
|
|
+ default case. It also preserves logs from failed copies, which
|
|
|
+ can be valuable for debugging. Finally, a failing map will not
|
|
|
+ cause the job to fail before all splits are attempted.
|
|
|
+ </td>
|
|
|
+</tr>
|
|
|
+
|
|
|
+<tr>
|
|
|
+<td colspan="1" rowspan="1"><span class="codefrag">-log <logdir></span></td>
|
|
|
+ <td colspan="1" rowspan="1">Write logs to <logdir></td>
|
|
|
+ <td colspan="1" rowspan="1">DistCp keeps logs of each file it attempts to copy as map
|
|
|
+ output. If a map fails, the log output will not be retained if
|
|
|
+ it is re-executed.
|
|
|
+ </td>
|
|
|
+</tr>
|
|
|
+
|
|
|
+<tr>
|
|
|
+<td colspan="1" rowspan="1"><span class="codefrag">-m <num_maps></span></td>
|
|
|
+ <td colspan="1" rowspan="1">Maximum number of simultaneous copies</td>
|
|
|
+ <td colspan="1" rowspan="1">Specify the number of maps to copy data. Note that more maps
|
|
|
+ may not necessarily improve throughput.
|
|
|
+ </td>
|
|
|
+</tr>
|
|
|
+
|
|
|
+<tr>
|
|
|
+<td colspan="1" rowspan="1"><span class="codefrag">-overwrite</span></td>
|
|
|
+ <td colspan="1" rowspan="1">Overwrite destination</td>
|
|
|
+ <td colspan="1" rowspan="1">If a map fails and <span class="codefrag">-i</span> is not specified, all the
|
|
|
+ files in the split, not only those that failed, will be recopied.
|
|
|
+ As discussed in the <a href="#uo">following</a>, it also changes
|
|
|
+ the semantics for generating destination paths, so users should
|
|
|
+ use this carefully.
|
|
|
+ </td>
|
|
|
+</tr>
|
|
|
+
|
|
|
+<tr>
|
|
|
+<td colspan="1" rowspan="1"><span class="codefrag">-update</span></td>
|
|
|
+ <td colspan="1" rowspan="1">Overwrite if src size different from dst size</td>
|
|
|
+ <td colspan="1" rowspan="1">As noted in the preceding, this is not a "sync"
|
|
|
+ operation. The only criterion examined is the source and
|
|
|
+ destination file sizes; if they differ, the source file
|
|
|
+ replaces the destination file. As discussed in the
|
|
|
+ <a href="#uo">following</a>, it also changes the semantics for
|
|
|
+ generating destination paths, so users should use this carefully.
|
|
|
+ </td>
|
|
|
+</tr>
|
|
|
+
|
|
|
+<tr>
|
|
|
+<td colspan="1" rowspan="1"><span class="codefrag">-f <urilist_uri></span></td>
|
|
|
+ <td colspan="1" rowspan="1">Use list at <urilist_uri> as src list</td>
|
|
|
+ <td colspan="1" rowspan="1">This is equivalent to listing each source on the command
|
|
|
+ line. The <span class="codefrag">urilist_uri</span> list should be a fully
|
|
|
+ qualified URI.
|
|
|
+ </td>
|
|
|
+</tr>
|
|
|
+
|
|
|
+
|
|
|
+</table>
|
|
|
+<a name="N10136"></a><a name="uo"></a>
|
|
|
+<h4>Update and Overwrite</h4>
|
|
|
+<p>It's worth giving some examples of <span class="codefrag">-update</span> and
|
|
|
+ <span class="codefrag">-overwrite</span>. Consider a copy from <span class="codefrag">/foo/a</span> and
|
|
|
+ <span class="codefrag">/foo/b</span> to <span class="codefrag">/bar/foo</span>, where the sources contain
|
|
|
+ the following:</p>
|
|
|
+<p>
|
|
|
+<span class="codefrag"> hdfs://nn1:8020/foo/a</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag"> hdfs://nn1:8020/foo/a/aa</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag"> hdfs://nn1:8020/foo/a/ab</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag"> hdfs://nn1:8020/foo/b</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag"> hdfs://nn1:8020/foo/b/ba</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag"> hdfs://nn1:8020/foo/b/ab</span>
|
|
|
+</p>
|
|
|
+<p>If either <span class="codefrag">-update</span> or <span class="codefrag">-overwrite</span> is set,
|
|
|
+ then both sources will map an entry to <span class="codefrag">/bar/foo/ab</span> at the
|
|
|
+ destination. For both options, the contents of each source directory
|
|
|
+ are compared with the <strong>contents</strong> of the destination
|
|
|
+ directory. Rather than permit this conflict, DistCp will abort.</p>
|
|
|
+<p>In the default case, both <span class="codefrag">/bar/foo/a</span> and
|
|
|
+ <span class="codefrag">/bar/foo/b</span> will be created and neither will collide.</p>
|
|
|
+<p>Now consider a legal copy using <span class="codefrag">-update</span>:<br>
|
|
|
+
|
|
|
+<span class="codefrag">distcp -update hdfs://nn1:8020/foo/a \</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag">
|
|
|
+
|
|
|
+ hdfs://nn1:8020/foo/b \</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag">
|
|
|
+
|
|
|
+ hdfs://nn2:8020/bar</span>
|
|
|
+</p>
|
|
|
+<p>With sources/sizes:</p>
|
|
|
+<p>
|
|
|
+<span class="codefrag"> hdfs://nn1:8020/foo/a</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag"> hdfs://nn1:8020/foo/a/aa 32</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag"> hdfs://nn1:8020/foo/a/ab 32</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag"> hdfs://nn1:8020/foo/b</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag"> hdfs://nn1:8020/foo/b/ba 64</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag"> hdfs://nn1:8020/foo/b/bb 32</span>
|
|
|
+</p>
|
|
|
+<p>And destination/sizes:</p>
|
|
|
+<p>
|
|
|
+<span class="codefrag"> hdfs://nn2:8020/bar</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag"> hdfs://nn2:8020/bar/aa 32</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag"> hdfs://nn2:8020/bar/ba 32</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag"> hdfs://nn2:8020/bar/bb 64</span>
|
|
|
+</p>
|
|
|
+<p>Will effect:</p>
|
|
|
+<p>
|
|
|
+<span class="codefrag"> hdfs://nn2:8020/bar</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag"> hdfs://nn2:8020/bar/aa 32</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag"> hdfs://nn2:8020/bar/ab 32</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag"> hdfs://nn2:8020/bar/ba 64</span>
|
|
|
+<br>
|
|
|
+
|
|
|
+<span class="codefrag"> hdfs://nn2:8020/bar/bb 32</span>
|
|
|
+</p>
|
|
|
+<p>Only <span class="codefrag">aa</span> is not overwritten on nn2. If
|
|
|
+ <span class="codefrag">-overwrite</span> were specified, all elements would be
|
|
|
+ overwritten.</p>
|
|
|
+</div> <!-- Usage -->
|
|
|
+
|
|
|
+
|
|
|
+<a name="N101E7"></a><a name="etc"></a>
|
|
|
+<h2 class="h3">Appendix</h2>
|
|
|
+<div class="section">
|
|
|
+<a name="N101ED"></a><a name="Map+sizing"></a>
|
|
|
+<h3 class="h4">Map sizing</h3>
|
|
|
+<p>DistCp makes a faint attempt to size each map comparably so that
|
|
|
+ each copies roughly the same number of bytes. Note that files are the
|
|
|
+ finest level of granularity, so increasing the number of simultaneous
|
|
|
+ copiers (i.e. maps) may not always increase the number of
|
|
|
+ simultaneous copies nor the overall throughput.</p>
|
|
|
+<p>If <span class="codefrag">-m</span> is not specified, DistCp will attempt to
|
|
|
+ schedule work for <span class="codefrag">min(total_bytes / bytes.per.map, 20 *
|
|
|
+ num_task_trackers)</span> where <span class="codefrag">bytes.per.map</span> defaults
|
|
|
+ to 256MB.</p>
|
|
|
+<p>Tuning the number of maps to the size of the source and
|
|
|
+ destination clusters, the size of the copy, and the available
|
|
|
+ bandwidth is recommended for long-running and regularly run jobs.</p>
|
|
|
+<a name="N10206"></a><a name="cpver"></a>
|
|
|
+<h3 class="h4">Copying between versions of HDFS</h3>
|
|
|
+<p>For copying between two different versions of Hadoop, one will
|
|
|
+ usually use HftpFileSystem. This is a read-only FileSystem, so DistCp
|
|
|
+ must be run on the destination cluster (more specifically, on
|
|
|
+ TaskTrackers that can write to the destination cluster). Each source is
|
|
|
+ specified as <span class="codefrag">hftp://<dfs.http.address>/<path></span>
|
|
|
+ (the default <span class="codefrag">dfs.http.address</span> is
|
|
|
+ <namenode>:50070).</p>
|
|
|
+<a name="N10216"></a><a name="Map%2FReduce+and+other+side-effects"></a>
|
|
|
+<h3 class="h4">Map/Reduce and other side-effects</h3>
|
|
|
+<p>As has been mentioned in the preceding, should a map fail to copy
|
|
|
+ one of its inputs, there will be several side-effects.</p>
|
|
|
+<ul>
|
|
|
+
|
|
|
+
|
|
|
+<li>Unless <span class="codefrag">-i</span> is specified, the logs generated by that
|
|
|
+ task attempt will be replaced by the previous attempt.</li>
|
|
|
+
|
|
|
+
|
|
|
+<li>Unless <span class="codefrag">-overwrite</span> is specified, files successfully
|
|
|
+ copied by a previous map on a re-execution will be marked as
|
|
|
+ "skipped".</li>
|
|
|
+
|
|
|
+
|
|
|
+<li>If a map fails <span class="codefrag">mapred.map.max.attempts</span> times, the
|
|
|
+ remaining map tasks will be killed (unless <span class="codefrag">-i</span> is
|
|
|
+ set).</li>
|
|
|
+
|
|
|
+
|
|
|
+<li>If <span class="codefrag">mapred.speculative.execution</span> is set set
|
|
|
+ <span class="codefrag">final</span> and <span class="codefrag">true</span>, the result of the copy is
|
|
|
+ undefined.</li>
|
|
|
+
|
|
|
+
|
|
|
+</ul>
|
|
|
+</div> <!-- Appendix -->
|
|
|
+
|
|
|
+
|
|
|
+</div>
|
|
|
+<!--+
|
|
|
+ |end content
|
|
|
+ +-->
|
|
|
+<div class="clearboth"> </div>
|
|
|
+</div>
|
|
|
+<div id="footer">
|
|
|
+<!--+
|
|
|
+ |start bottomstrip
|
|
|
+ +-->
|
|
|
+<div class="lastmodified">
|
|
|
+<script type="text/javascript"><!--
|
|
|
+document.write("Last Published: " + document.lastModified);
|
|
|
+// --></script>
|
|
|
+</div>
|
|
|
+<div class="copyright">
|
|
|
+ Copyright ©
|
|
|
+ 2007 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a>
|
|
|
+</div>
|
|
|
+<!--+
|
|
|
+ |end bottomstrip
|
|
|
+ +-->
|
|
|
+</div>
|
|
|
+</body>
|
|
|
+</html>
|