123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162 |
- <!--
- Copyright 2002-2004 The Apache Software Foundation
- Licensed under the Apache License, Version 2.0 (the "License");
- you may not use this file except in compliance with the License.
- You may obtain a copy of the License at
- http://www.apache.org/licenses/LICENSE-2.0
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
- -->
- <document xmlns="http://maven.apache.org/XDOC/2.0"
- xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
- xsi:schemaLocation="http://maven.apache.org/XDOC/2.0 http://maven.apache.org/xsd/xdoc-2.0.xsd">
- <head>
- <title>Usage </title>
- </head>
- <body>
- <section name="Basic Usage">
- <p>The most common invocation of DistCp is an inter-cluster copy:</p>
- <p><code>bash$ hadoop jar hadoop-distcp.jar hdfs://nn1:8020/foo/bar \</code><br/>
- <code> hdfs://nn2:8020/bar/foo</code></p>
- <p>This will expand the namespace under <code>/foo/bar</code> on nn1
- into a temporary file, partition its contents among a set of map
- tasks, and start a copy on each TaskTracker from nn1 to nn2.</p>
- <p>One can also specify multiple source directories on the command
- line:</p>
- <p><code>bash$ hadoop jar hadoop-distcp.jar hdfs://nn1:8020/foo/a \</code><br/>
- <code> hdfs://nn1:8020/foo/b \</code><br/>
- <code> hdfs://nn2:8020/bar/foo</code></p>
- <p>Or, equivalently, from a file using the <code>-f</code> option:<br/>
- <code>bash$ hadoop jar hadoop-distcp.jar -f hdfs://nn1:8020/srclist \</code><br/>
- <code> hdfs://nn2:8020/bar/foo</code><br/></p>
- <p>Where <code>srclist</code> contains<br/>
- <code>hdfs://nn1:8020/foo/a</code><br/>
- <code>hdfs://nn1:8020/foo/b</code></p>
- <p>When copying from multiple sources, DistCp will abort the copy with
- an error message if two sources collide, but collisions at the
- destination are resolved per the <a href="#options">options</a>
- specified. By default, files already existing at the destination are
- skipped (i.e. not replaced by the source file). A count of skipped
- files is reported at the end of each job, but it may be inaccurate if a
- copier failed for some subset of its files, but succeeded on a later
- attempt.</p>
- <p>It is important that each TaskTracker can reach and communicate with
- both the source and destination file systems. For HDFS, both the source
- and destination must be running the same version of the protocol or use
- a backwards-compatible protocol (see <a href="#cpver">Copying Between
- Versions</a>).</p>
- <p>After a copy, it is recommended that one generates and cross-checks
- a listing of the source and destination to verify that the copy was
- truly successful. Since DistCp employs both Map/Reduce and the
- FileSystem API, issues in or between any of the three could adversely
- and silently affect the copy. Some have had success running with
- <code>-update</code> enabled to perform a second pass, but users should
- be acquainted with its semantics before attempting this.</p>
- <p>It's also worth noting that if another client is still writing to a
- source file, the copy will likely fail. Attempting to overwrite a file
- being written at the destination should also fail on HDFS. If a source
- file is (re)moved before it is copied, the copy will fail with a
- FileNotFoundException.</p>
- <p>Please refer to the detailed Command Line Reference for information
- on all the options available in DistCp.</p>
-
- </section>
- <section name="Update and Overwrite">
- <p><code>-update</code> is used to copy files from source that don't
- exist at the target, or have different contents. <code>-overwrite</code>
- overwrites target-files even if they exist at the source, or have the
- same contents.</p>
- <p><br/>Update and Overwrite options warrant special attention, since their
- handling of source-paths varies from the defaults in a very subtle manner.
- Consider a copy from <code>/source/first/</code> and
- <code>/source/second/</code> to <code>/target/</code>, where the source
- paths have the following contents:</p>
- <p><code>hdfs://nn1:8020/source/first/1</code><br/>
- <code>hdfs://nn1:8020/source/first/2</code><br/>
- <code>hdfs://nn1:8020/source/second/10</code><br/>
- <code>hdfs://nn1:8020/source/second/20</code><br/></p>
- <p><br/>When DistCp is invoked without <code>-update</code> or
- <code>-overwrite</code>, the DistCp defaults would create directories
- <code>first/</code> and <code>second/</code>, under <code>/target</code>.
- Thus:<br/></p>
- <p><code>distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target</code></p>
- <p><br/>would yield the following contents in <code>/target</code>: </p>
- <p><code>hdfs://nn2:8020/target/first/1</code><br/>
- <code>hdfs://nn2:8020/target/first/2</code><br/>
- <code>hdfs://nn2:8020/target/second/10</code><br/>
- <code>hdfs://nn2:8020/target/second/20</code><br/></p>
- <p><br/>When either <code>-update</code> or <code>-overwrite</code> is
- specified, the <strong>contents</strong> of the source-directories
- are copied to target, and not the source directories themselves. Thus: </p>
- <p><code>distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target</code></p>
- <p><br/>would yield the following contents in <code>/target</code>: </p>
- <p><code>hdfs://nn2:8020/target/1</code><br/>
- <code>hdfs://nn2:8020/target/2</code><br/>
- <code>hdfs://nn2:8020/target/10</code><br/>
- <code>hdfs://nn2:8020/target/20</code><br/></p>
- <p><br/>By extension, if both source folders contained a file with the same
- name (say, <code>0</code>), then both sources would map an entry to
- <code>/target/0</code> at the destination. Rather than to permit this
- conflict, DistCp will abort.</p>
- <p><br/>Now, consider the following copy operation:</p>
- <p><code>distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target</code></p>
- <p><br/>With sources/sizes:</p>
- <p><code>hdfs://nn1:8020/source/first/1 32</code><br/>
- <code>hdfs://nn1:8020/source/first/2 32</code><br/>
- <code>hdfs://nn1:8020/source/second/10 64</code><br/>
- <code>hdfs://nn1:8020/source/second/20 32</code><br/></p>
- <p><br/>And destination/sizes:</p>
- <p><code>hdfs://nn2:8020/target/1 32</code><br/>
- <code>hdfs://nn2:8020/target/10 32</code><br/>
- <code>hdfs://nn2:8020/target/20 64</code><br/></p>
- <p><br/>Will effect: </p>
- <p><code>hdfs://nn2:8020/target/1 32</code><br/>
- <code>hdfs://nn2:8020/target/2 32</code><br/>
- <code>hdfs://nn2:8020/target/10 64</code><br/>
- <code>hdfs://nn2:8020/target/20 32</code><br/></p>
- <p><br/><code>1</code> is skipped because the file-length and contents match.
- <code>2</code> is copied because it doesn't exist at the target.
- <code>10</code> and <code>20</code> are overwritten since the contents
- don't match the source. </p>
- <p>If <code>-update</code> is used, <code>1</code> is overwritten as well.</p>
- </section>
- </body>
- </document>
|