distcp.html 23 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649
  1. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
  2. <html>
  3. <head>
  4. <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
  5. <meta content="Apache Forrest" name="Generator">
  6. <meta name="Forrest-version" content="0.8">
  7. <meta name="Forrest-skin-name" content="pelt">
  8. <title>Hadoop DistCp Guide</title>
  9. <link type="text/css" href="skin/basic.css" rel="stylesheet">
  10. <link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
  11. <link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
  12. <link type="text/css" href="skin/profile.css" rel="stylesheet">
  13. <script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
  14. <link rel="shortcut icon" href="images/favicon.ico">
  15. </head>
  16. <body onload="init()">
  17. <script type="text/javascript">ndeSetTextSize();</script>
  18. <div id="top">
  19. <!--+
  20. |breadtrail
  21. +-->
  22. <div class="breadtrail">
  23. <a href="http://www.apache.org/">Apache</a> &gt; <a href="http://hadoop.apache.org/">Hadoop</a> &gt; <a href="http://hadoop.apache.org/core/">Core</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
  24. </div>
  25. <!--+
  26. |header
  27. +-->
  28. <div class="header">
  29. <!--+
  30. |start group logo
  31. +-->
  32. <div class="grouplogo">
  33. <a href="http://hadoop.apache.org/"><img class="logoImage" alt="Hadoop" src="images/hadoop-logo.jpg" title="Apache Hadoop"></a>
  34. </div>
  35. <!--+
  36. |end group logo
  37. +-->
  38. <!--+
  39. |start Project Logo
  40. +-->
  41. <div class="projectlogo">
  42. <a href="http://hadoop.apache.org/core/"><img class="logoImage" alt="Hadoop" src="images/core-logo.gif" title="Scalable Computing Platform"></a>
  43. </div>
  44. <!--+
  45. |end Project Logo
  46. +-->
  47. <!--+
  48. |start Search
  49. +-->
  50. <div class="searchbox">
  51. <form action="http://www.google.com/search" method="get" class="roundtopsmall">
  52. <input value="hadoop.apache.org" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="25" name="q" id="query" type="text" value="Search the site with google">&nbsp;
  53. <input name="Search" value="Search" type="submit">
  54. </form>
  55. </div>
  56. <!--+
  57. |end search
  58. +-->
  59. <!--+
  60. |start Tabs
  61. +-->
  62. <ul id="tabs">
  63. <li>
  64. <a class="unselected" href="http://hadoop.apache.org/core/">Project</a>
  65. </li>
  66. <li>
  67. <a class="unselected" href="http://wiki.apache.org/hadoop">Wiki</a>
  68. </li>
  69. <li class="current">
  70. <a class="selected" href="index.html">Hadoop 0.20 Documentation</a>
  71. </li>
  72. </ul>
  73. <!--+
  74. |end Tabs
  75. +-->
  76. </div>
  77. </div>
  78. <div id="main">
  79. <div id="publishedStrip">
  80. <!--+
  81. |start Subtabs
  82. +-->
  83. <div id="level2tabs"></div>
  84. <!--+
  85. |end Endtabs
  86. +-->
  87. <script type="text/javascript"><!--
  88. document.write("Last Published: " + document.lastModified);
  89. // --></script>
  90. </div>
  91. <!--+
  92. |breadtrail
  93. +-->
  94. <div class="breadtrail">
  95. &nbsp;
  96. </div>
  97. <!--+
  98. |start Menu, mainarea
  99. +-->
  100. <!--+
  101. |start Menu
  102. +-->
  103. <div id="menu">
  104. <div onclick="SwitchMenu('menu_selected_1.1', 'skin/')" id="menu_selected_1.1Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Documentation</div>
  105. <div id="menu_selected_1.1" class="selectedmenuitemgroup" style="display: block;">
  106. <div class="menuitem">
  107. <a href="index.html">Overview</a>
  108. </div>
  109. <div class="menuitem">
  110. <a href="quickstart.html">Hadoop Quick Start</a>
  111. </div>
  112. <div class="menuitem">
  113. <a href="cluster_setup.html">Hadoop Cluster Setup</a>
  114. </div>
  115. <div class="menuitem">
  116. <a href="mapred_tutorial.html">Hadoop Map/Reduce Tutorial</a>
  117. </div>
  118. <div class="menuitem">
  119. <a href="commands_manual.html">Hadoop Command Guide</a>
  120. </div>
  121. <div class="menuitem">
  122. <a href="hdfs_shell.html">Hadoop FS Shell Guide</a>
  123. </div>
  124. <div class="menupage">
  125. <div class="menupagetitle">Hadoop DistCp Guide</div>
  126. </div>
  127. <div class="menuitem">
  128. <a href="native_libraries.html">Hadoop Native Libraries</a>
  129. </div>
  130. <div class="menuitem">
  131. <a href="streaming.html">Hadoop Streaming</a>
  132. </div>
  133. <div class="menuitem">
  134. <a href="hadoop_archives.html">Hadoop Archives</a>
  135. </div>
  136. <div class="menuitem">
  137. <a href="hdfs_user_guide.html">HDFS User Guide</a>
  138. </div>
  139. <div class="menuitem">
  140. <a href="hdfs_design.html">HDFS Architecture</a>
  141. </div>
  142. <div class="menuitem">
  143. <a href="hdfs_permissions_guide.html">HDFS Admin Guide: Permissions</a>
  144. </div>
  145. <div class="menuitem">
  146. <a href="hdfs_quota_admin_guide.html">HDFS Admin Guide: Quotas</a>
  147. </div>
  148. <div class="menuitem">
  149. <a href="SLG_user_guide.html">HDFS Utilities</a>
  150. </div>
  151. <div class="menuitem">
  152. <a href="libhdfs.html">HDFS C API</a>
  153. </div>
  154. <div class="menuitem">
  155. <a href="hod_user_guide.html">HOD User Guide</a>
  156. </div>
  157. <div class="menuitem">
  158. <a href="hod_admin_guide.html">HOD Admin Guide</a>
  159. </div>
  160. <div class="menuitem">
  161. <a href="hod_config_guide.html">HOD Config Guide</a>
  162. </div>
  163. <div class="menuitem">
  164. <a href="capacity_scheduler.html">Capacity Scheduler</a>
  165. </div>
  166. <div class="menuitem">
  167. <a href="vaidya.html">Hadoop Vaidya</a>
  168. </div>
  169. <div class="menuitem">
  170. <a href="api/index.html">API Docs</a>
  171. </div>
  172. <div class="menuitem">
  173. <a href="jdiff/changes.html">API Changes</a>
  174. </div>
  175. <div class="menuitem">
  176. <a href="http://wiki.apache.org/hadoop/">Wiki</a>
  177. </div>
  178. <div class="menuitem">
  179. <a href="http://wiki.apache.org/hadoop/FAQ">FAQ</a>
  180. </div>
  181. <div class="menuitem">
  182. <a href="releasenotes.html">Release Notes</a>
  183. </div>
  184. <div class="menuitem">
  185. <a href="changes.html">Change Log</a>
  186. </div>
  187. </div>
  188. <div id="credit"></div>
  189. <div id="roundbottom">
  190. <img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
  191. <!--+
  192. |alternative credits
  193. +-->
  194. <div id="credit2"></div>
  195. </div>
  196. <!--+
  197. |end Menu
  198. +-->
  199. <!--+
  200. |start content
  201. +-->
  202. <div id="content">
  203. <div title="Portable Document Format" class="pdflink">
  204. <a class="dida" href="distcp.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br>
  205. PDF</a>
  206. </div>
  207. <h1>Hadoop DistCp Guide</h1>
  208. <div id="minitoc-area">
  209. <ul class="minitoc">
  210. <li>
  211. <a href="#Overview">Overview</a>
  212. </li>
  213. <li>
  214. <a href="#Usage">Usage</a>
  215. <ul class="minitoc">
  216. <li>
  217. <a href="#Basic">Basic</a>
  218. </li>
  219. <li>
  220. <a href="#options">Options</a>
  221. <ul class="minitoc">
  222. <li>
  223. <a href="#Option+Index">Option Index</a>
  224. </li>
  225. <li>
  226. <a href="#Symbolic-Representations">Symbolic Representations</a>
  227. </li>
  228. <li>
  229. <a href="#uo">Update and Overwrite</a>
  230. </li>
  231. </ul>
  232. </li>
  233. </ul>
  234. </li>
  235. <li>
  236. <a href="#etc">Appendix</a>
  237. <ul class="minitoc">
  238. <li>
  239. <a href="#Map+sizing">Map sizing</a>
  240. </li>
  241. <li>
  242. <a href="#cpver">Copying between versions of HDFS</a>
  243. </li>
  244. <li>
  245. <a href="#Map%2FReduce+and+other+side-effects">Map/Reduce and other side-effects</a>
  246. </li>
  247. </ul>
  248. </li>
  249. </ul>
  250. </div>
  251. <a name="N1000D"></a><a name="Overview"></a>
  252. <h2 class="h3">Overview</h2>
  253. <div class="section">
  254. <p>DistCp (distributed copy) is a tool used for large inter/intra-cluster
  255. copying. It uses Map/Reduce to effect its distribution, error
  256. handling and recovery, and reporting. It expands a list of files and
  257. directories into input to map tasks, each of which will copy a partition
  258. of the files specified in the source list. Its Map/Reduce pedigree has
  259. endowed it with some quirks in both its semantics and execution. The
  260. purpose of this document is to offer guidance for common tasks and to
  261. elucidate its model.</p>
  262. </div>
  263. <a name="N10017"></a><a name="Usage"></a>
  264. <h2 class="h3">Usage</h2>
  265. <div class="section">
  266. <a name="N1001D"></a><a name="Basic"></a>
  267. <h3 class="h4">Basic</h3>
  268. <p>The most common invocation of DistCp is an inter-cluster copy:</p>
  269. <p>
  270. <span class="codefrag">bash$ hadoop distcp hdfs://nn1:8020/foo/bar \</span>
  271. <br>
  272. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  273. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  274. hdfs://nn2:8020/bar/foo</span>
  275. </p>
  276. <p>This will expand the namespace under <span class="codefrag">/foo/bar</span> on nn1
  277. into a temporary file, partition its contents among a set of map
  278. tasks, and start a copy on each TaskTracker from nn1 to nn2. Note
  279. that DistCp expects absolute paths.</p>
  280. <p>One can also specify multiple source directories on the command
  281. line:</p>
  282. <p>
  283. <span class="codefrag">bash$ hadoop distcp hdfs://nn1:8020/foo/a \</span>
  284. <br>
  285. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  286. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  287. hdfs://nn1:8020/foo/b \</span>
  288. <br>
  289. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  290. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  291. hdfs://nn2:8020/bar/foo</span>
  292. </p>
  293. <p>Or, equivalently, from a file using the <span class="codefrag">-f</span> option:<br>
  294. <span class="codefrag">bash$ hadoop distcp -f hdfs://nn1:8020/srclist \</span>
  295. <br>
  296. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  297. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  298. &nbsp;hdfs://nn2:8020/bar/foo</span>
  299. <br>
  300. </p>
  301. <p>Where <span class="codefrag">srclist</span> contains<br>
  302. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn1:8020/foo/a</span>
  303. <br>
  304. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn1:8020/foo/b</span>
  305. </p>
  306. <p>When copying from multiple sources, DistCp will abort the copy with
  307. an error message if two sources collide, but collisions at the
  308. destination are resolved per the <a href="#options">options</a>
  309. specified. By default, files already existing at the destination are
  310. skipped (i.e. not replaced by the source file). A count of skipped
  311. files is reported at the end of each job, but it may be inaccurate if a
  312. copier failed for some subset of its files, but succeeded on a later
  313. attempt (see <a href="#etc">Appendix</a>).</p>
  314. <p>It is important that each TaskTracker can reach and communicate with
  315. both the source and destination file systems. For HDFS, both the source
  316. and destination must be running the same version of the protocol or use
  317. a backwards-compatible protocol (see <a href="#cpver">Copying Between
  318. Versions</a>).</p>
  319. <p>After a copy, it is recommended that one generates and cross-checks
  320. a listing of the source and destination to verify that the copy was
  321. truly successful. Since DistCp employs both Map/Reduce and the
  322. FileSystem API, issues in or between any of the three could adversely
  323. and silently affect the copy. Some have had success running with
  324. <span class="codefrag">-update</span> enabled to perform a second pass, but users should
  325. be acquainted with its semantics before attempting this.</p>
  326. <p>It's also worth noting that if another client is still writing to a
  327. source file, the copy will likely fail. Attempting to overwrite a file
  328. being written at the destination should also fail on HDFS. If a source
  329. file is (re)moved before it is copied, the copy will fail with a
  330. FileNotFoundException.</p>
  331. <a name="N1007E"></a><a name="options"></a>
  332. <h3 class="h4">Options</h3>
  333. <a name="N10084"></a><a name="Option+Index"></a>
  334. <h4>Option Index</h4>
  335. <table class="ForrestTable" cellspacing="1" cellpadding="4">
  336. <tr>
  337. <th colspan="1" rowspan="1"> Flag </th><th colspan="1" rowspan="1"> Description </th><th colspan="1" rowspan="1"> Notes </th>
  338. </tr>
  339. <tr>
  340. <td colspan="1" rowspan="1"><span class="codefrag">-p[rbugp]</span></td>
  341. <td colspan="1" rowspan="1">Preserve<br>
  342. &nbsp;&nbsp;r: replication number<br>
  343. &nbsp;&nbsp;b: block size<br>
  344. &nbsp;&nbsp;u: user<br>
  345. &nbsp;&nbsp;g: group<br>
  346. &nbsp;&nbsp;p: permission<br>
  347. </td>
  348. <td colspan="1" rowspan="1">Modification times are not preserved. Also, when
  349. <span class="codefrag">-update</span> is specified, status updates will
  350. <strong>not</strong> be synchronized unless the file sizes
  351. also differ (i.e. unless the file is re-created).
  352. </td>
  353. </tr>
  354. <tr>
  355. <td colspan="1" rowspan="1"><span class="codefrag">-i</span></td>
  356. <td colspan="1" rowspan="1">Ignore failures</td>
  357. <td colspan="1" rowspan="1">As explained in the <a href="#etc">Appendix</a>, this option
  358. will keep more accurate statistics about the copy than the
  359. default case. It also preserves logs from failed copies, which
  360. can be valuable for debugging. Finally, a failing map will not
  361. cause the job to fail before all splits are attempted.
  362. </td>
  363. </tr>
  364. <tr>
  365. <td colspan="1" rowspan="1"><span class="codefrag">-log &lt;logdir&gt;</span></td>
  366. <td colspan="1" rowspan="1">Write logs to &lt;logdir&gt;</td>
  367. <td colspan="1" rowspan="1">DistCp keeps logs of each file it attempts to copy as map
  368. output. If a map fails, the log output will not be retained if
  369. it is re-executed.
  370. </td>
  371. </tr>
  372. <tr>
  373. <td colspan="1" rowspan="1"><span class="codefrag">-m &lt;num_maps&gt;</span></td>
  374. <td colspan="1" rowspan="1">Maximum number of simultaneous copies</td>
  375. <td colspan="1" rowspan="1">Specify the number of maps to copy data. Note that more maps
  376. may not necessarily improve throughput.
  377. </td>
  378. </tr>
  379. <tr>
  380. <td colspan="1" rowspan="1"><span class="codefrag">-overwrite</span></td>
  381. <td colspan="1" rowspan="1">Overwrite destination</td>
  382. <td colspan="1" rowspan="1">If a map fails and <span class="codefrag">-i</span> is not specified, all the
  383. files in the split, not only those that failed, will be recopied.
  384. As discussed in the <a href="#uo">following</a>, it also changes
  385. the semantics for generating destination paths, so users should
  386. use this carefully.
  387. </td>
  388. </tr>
  389. <tr>
  390. <td colspan="1" rowspan="1"><span class="codefrag">-update</span></td>
  391. <td colspan="1" rowspan="1">Overwrite if src size different from dst size</td>
  392. <td colspan="1" rowspan="1">As noted in the preceding, this is not a "sync"
  393. operation. The only criterion examined is the source and
  394. destination file sizes; if they differ, the source file
  395. replaces the destination file. As discussed in the
  396. <a href="#uo">following</a>, it also changes the semantics for
  397. generating destination paths, so users should use this carefully.
  398. </td>
  399. </tr>
  400. <tr>
  401. <td colspan="1" rowspan="1"><span class="codefrag">-f &lt;urilist_uri&gt;</span></td>
  402. <td colspan="1" rowspan="1">Use list at &lt;urilist_uri&gt; as src list</td>
  403. <td colspan="1" rowspan="1">This is equivalent to listing each source on the command
  404. line. The <span class="codefrag">urilist_uri</span> list should be a fully
  405. qualified URI.
  406. </td>
  407. </tr>
  408. <tr>
  409. <td colspan="1" rowspan="1"><span class="codefrag">-filelimit &lt;n&gt;</span></td>
  410. <td colspan="1" rowspan="1">Limit the total number of files to be &lt;= n</td>
  411. <td colspan="1" rowspan="1">See also <a href="#Symbolic-Representations">Symbolic
  412. Representations</a>.
  413. </td>
  414. </tr>
  415. <tr>
  416. <td colspan="1" rowspan="1"><span class="codefrag">-sizelimit &lt;n&gt;</span></td>
  417. <td colspan="1" rowspan="1">Limit the total size to be &lt;= n bytes</td>
  418. <td colspan="1" rowspan="1">See also <a href="#Symbolic-Representations">Symbolic
  419. Representations</a>.
  420. </td>
  421. </tr>
  422. <tr>
  423. <td colspan="1" rowspan="1"><span class="codefrag">-delete</span></td>
  424. <td colspan="1" rowspan="1">Delete the files existing in the dst but not in src</td>
  425. <td colspan="1" rowspan="1">The deletion is done by FS Shell. So the trash will be used,
  426. if it is enable.
  427. </td>
  428. </tr>
  429. </table>
  430. <a name="N10171"></a><a name="Symbolic-Representations"></a>
  431. <h4>Symbolic Representations</h4>
  432. <p>
  433. The parameter &lt;n&gt; in <span class="codefrag">-filelimit</span>
  434. and <span class="codefrag">-sizelimit</span> can be specified with symbolic
  435. representation. For examples,
  436. </p>
  437. <ul>
  438. <li>1230k = 1230 * 1024 = 1259520</li>
  439. <li>891g = 891 * 1024^3 = 956703965184</li>
  440. </ul>
  441. <a name="N1018A"></a><a name="uo"></a>
  442. <h4>Update and Overwrite</h4>
  443. <p>It's worth giving some examples of <span class="codefrag">-update</span> and
  444. <span class="codefrag">-overwrite</span>. Consider a copy from <span class="codefrag">/foo/a</span> and
  445. <span class="codefrag">/foo/b</span> to <span class="codefrag">/bar/foo</span>, where the sources contain
  446. the following:</p>
  447. <p>
  448. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn1:8020/foo/a</span>
  449. <br>
  450. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn1:8020/foo/a/aa</span>
  451. <br>
  452. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn1:8020/foo/a/ab</span>
  453. <br>
  454. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn1:8020/foo/b</span>
  455. <br>
  456. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn1:8020/foo/b/ba</span>
  457. <br>
  458. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn1:8020/foo/b/ab</span>
  459. </p>
  460. <p>If either <span class="codefrag">-update</span> or <span class="codefrag">-overwrite</span> is set,
  461. then both sources will map an entry to <span class="codefrag">/bar/foo/ab</span> at the
  462. destination. For both options, the contents of each source directory
  463. are compared with the <strong>contents</strong> of the destination
  464. directory. Rather than permit this conflict, DistCp will abort.</p>
  465. <p>In the default case, both <span class="codefrag">/bar/foo/a</span> and
  466. <span class="codefrag">/bar/foo/b</span> will be created and neither will collide.</p>
  467. <p>Now consider a legal copy using <span class="codefrag">-update</span>:<br>
  468. <span class="codefrag">distcp -update hdfs://nn1:8020/foo/a \</span>
  469. <br>
  470. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  471. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  472. hdfs://nn1:8020/foo/b \</span>
  473. <br>
  474. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  475. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  476. hdfs://nn2:8020/bar</span>
  477. </p>
  478. <p>With sources/sizes:</p>
  479. <p>
  480. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn1:8020/foo/a</span>
  481. <br>
  482. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn1:8020/foo/a/aa 32</span>
  483. <br>
  484. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn1:8020/foo/a/ab 32</span>
  485. <br>
  486. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn1:8020/foo/b</span>
  487. <br>
  488. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn1:8020/foo/b/ba 64</span>
  489. <br>
  490. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn1:8020/foo/b/bb 32</span>
  491. </p>
  492. <p>And destination/sizes:</p>
  493. <p>
  494. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn2:8020/bar</span>
  495. <br>
  496. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn2:8020/bar/aa 32</span>
  497. <br>
  498. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn2:8020/bar/ba 32</span>
  499. <br>
  500. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn2:8020/bar/bb 64</span>
  501. </p>
  502. <p>Will effect:</p>
  503. <p>
  504. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn2:8020/bar</span>
  505. <br>
  506. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn2:8020/bar/aa 32</span>
  507. <br>
  508. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn2:8020/bar/ab 32</span>
  509. <br>
  510. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn2:8020/bar/ba 64</span>
  511. <br>
  512. <span class="codefrag">&nbsp;&nbsp;&nbsp;&nbsp;hdfs://nn2:8020/bar/bb 32</span>
  513. </p>
  514. <p>Only <span class="codefrag">aa</span> is not overwritten on nn2. If
  515. <span class="codefrag">-overwrite</span> were specified, all elements would be
  516. overwritten.</p>
  517. </div> <!-- Usage -->
  518. <a name="N1023B"></a><a name="etc"></a>
  519. <h2 class="h3">Appendix</h2>
  520. <div class="section">
  521. <a name="N10241"></a><a name="Map+sizing"></a>
  522. <h3 class="h4">Map sizing</h3>
  523. <p>DistCp makes a faint attempt to size each map comparably so that
  524. each copies roughly the same number of bytes. Note that files are the
  525. finest level of granularity, so increasing the number of simultaneous
  526. copiers (i.e. maps) may not always increase the number of
  527. simultaneous copies nor the overall throughput.</p>
  528. <p>If <span class="codefrag">-m</span> is not specified, DistCp will attempt to
  529. schedule work for <span class="codefrag">min (total_bytes / bytes.per.map, 20 *
  530. num_task_trackers)</span> where <span class="codefrag">bytes.per.map</span> defaults
  531. to 256MB.</p>
  532. <p>Tuning the number of maps to the size of the source and
  533. destination clusters, the size of the copy, and the available
  534. bandwidth is recommended for long-running and regularly run jobs.</p>
  535. <a name="N1025A"></a><a name="cpver"></a>
  536. <h3 class="h4">Copying between versions of HDFS</h3>
  537. <p>For copying between two different versions of Hadoop, one will
  538. usually use HftpFileSystem. This is a read-only FileSystem, so DistCp
  539. must be run on the destination cluster (more specifically, on
  540. TaskTrackers that can write to the destination cluster). Each source is
  541. specified as <span class="codefrag">hftp://&lt;dfs.http.address&gt;/&lt;path&gt;</span>
  542. (the default <span class="codefrag">dfs.http.address</span> is
  543. &lt;namenode&gt;:50070).</p>
  544. <a name="N1026A"></a><a name="Map%2FReduce+and+other+side-effects"></a>
  545. <h3 class="h4">Map/Reduce and other side-effects</h3>
  546. <p>As has been mentioned in the preceding, should a map fail to copy
  547. one of its inputs, there will be several side-effects.</p>
  548. <ul>
  549. <li>Unless <span class="codefrag">-i</span> is specified, the logs generated by that
  550. task attempt will be replaced by the previous attempt.</li>
  551. <li>Unless <span class="codefrag">-overwrite</span> is specified, files successfully
  552. copied by a previous map on a re-execution will be marked as
  553. "skipped".</li>
  554. <li>If a map fails <span class="codefrag">mapred.map.max.attempts</span> times, the
  555. remaining map tasks will be killed (unless <span class="codefrag">-i</span> is
  556. set).</li>
  557. <li>If <span class="codefrag">mapred.speculative.execution</span> is set set
  558. <span class="codefrag">final</span> and <span class="codefrag">true</span>, the result of the copy is
  559. undefined.</li>
  560. </ul>
  561. </div> <!-- Appendix -->
  562. </div>
  563. <!--+
  564. |end content
  565. +-->
  566. <div class="clearboth">&nbsp;</div>
  567. </div>
  568. <div id="footer">
  569. <!--+
  570. |start bottomstrip
  571. +-->
  572. <div class="lastmodified">
  573. <script type="text/javascript"><!--
  574. document.write("Last Published: " + document.lastModified);
  575. // --></script>
  576. </div>
  577. <div class="copyright">
  578. Copyright &copy;
  579. 2008 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a>
  580. </div>
  581. <!--+
  582. |end bottomstrip
  583. +-->
  584. </div>
  585. </body>
  586. </html>