123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215 |
- <?xml version="1.0" encoding="UTF-8"?>
- <!--
- Copyright 2002-2004 The Apache Software Foundation
- Licensed under the Apache License, Version 2.0 (the "License");
- you may not use this file except in compliance with the License.
- You may obtain a copy of the License at
- http://www.apache.org/licenses/LICENSE-2.0
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
- -->
- <document xmlns="http://maven.apache.org/XDOC/2.0"
- xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
- xsi:schemaLocation="http://maven.apache.org/XDOC/2.0 http://maven.apache.org/xsd/xdoc-2.0.xsd">
- <head>
- <title>Architecture of DistCp</title>
- </head>
- <body>
- <section name="Architecture">
- <p>The components of the new DistCp may be classified into the following
- categories: </p>
- <ul>
- <li>DistCp Driver</li>
- <li>Copy-listing generator</li>
- <li>Input-formats and Map-Reduce components</li>
- </ul>
- <subsection name="DistCp Driver">
- <p>The DistCp Driver components are responsible for:</p>
- <ul>
- <li>Parsing the arguments passed to the DistCp command on the
- command-line, via:
- <ul>
- <li>OptionsParser, and</li>
- <li>DistCpOptionsSwitch</li>
- </ul>
- </li>
- <li>Assembling the command arguments into an appropriate
- DistCpOptions object, and initializing DistCp. These arguments
- include:
- <ul>
- <li>Source-paths</li>
- <li>Target location</li>
- <li>Copy options (e.g. whether to update-copy, overwrite, which
- file-attributes to preserve, etc.)</li>
- </ul>
- </li>
- <li>Orchestrating the copy operation by:
- <ul>
- <li>Invoking the copy-listing-generator to create the list of
- files to be copied.</li>
- <li>Setting up and launching the Hadoop Map-Reduce Job to carry
- out the copy.</li>
- <li>Based on the options, either returning a handle to the
- Hadoop MR Job immediately, or waiting till completion.</li>
- </ul>
- </li>
- </ul>
- <br/>
- <p>The parser-elements are exercised only from the command-line (or if
- DistCp::run() is invoked). The DistCp class may also be used
- programmatically, by constructing the DistCpOptions object, and
- initializing a DistCp object appropriately.</p>
- </subsection>
- <subsection name="Copy-listing generator">
- <p>The copy-listing-generator classes are responsible for creating the
- list of files/directories to be copied from source. They examine
- the contents of the source-paths (files/directories, including
- wild-cards), and record all paths that need copy into a sequence-
- file, for consumption by the DistCp Hadoop Job. The main classes in
- this module include:</p>
- <ol>
- <li>CopyListing: The interface that should be implemented by any
- copy-listing-generator implementation. Also provides the factory
- method by which the concrete CopyListing implementation is
- chosen.</li>
- <li>SimpleCopyListing: An implementation of CopyListing that accepts
- multiple source paths (files/directories), and recursively lists
- all the individual files and directories under each, for
- copy.</li>
- <li>GlobbedCopyListing: Another implementation of CopyListing that
- expands wild-cards in the source paths.</li>
- <li>FileBasedCopyListing: An implementation of CopyListing that
- reads the source-path list from a specified file.</li>
- </ol>
- <p/>
- <p>Based on whether a source-file-list is specified in the
- DistCpOptions, the source-listing is generated in one of the
- following ways:</p>
- <ol>
- <li>If there's no source-file-list, the GlobbedCopyListing is used.
- All wild-cards are expanded, and all the expansions are
- forwarded to the SimpleCopyListing, which in turn constructs the
- listing (via recursive descent of each path). </li>
- <li>If a source-file-list is specified, the FileBasedCopyListing is
- used. Source-paths are read from the specified file, and then
- forwarded to the GlobbedCopyListing. The listing is then
- constructed as described above.</li>
- </ol>
- <br/>
- <p>One may customize the method by which the copy-listing is
- constructed by providing a custom implementation of the CopyListing
- interface. The behaviour of DistCp differs here from the legacy
- DistCp, in how paths are considered for copy. </p>
- <p>The legacy implementation only lists those paths that must
- definitely be copied on to target.
- E.g. if a file already exists at the target (and -overwrite isn't
- specified), the file isn't even considered in the Map-Reduce Copy
- Job. Determining this during setup (i.e. before the Map-Reduce Job)
- involves file-size and checksum-comparisons that are potentially
- time-consuming.</p>
- <p>The new DistCp postpones such checks until the Map-Reduce Job, thus
- reducing setup time. Performance is enhanced further since these
- checks are parallelized across multiple maps.</p>
- </subsection>
- <subsection name="Input-formats and Map-Reduce components">
- <p> The Input-formats and Map-Reduce components are responsible for
- the actual copy of files and directories from the source to the
- destination path. The listing-file created during copy-listing
- generation is consumed at this point, when the copy is carried
- out. The classes of interest here include:</p>
- <ul>
- <li><strong>UniformSizeInputFormat:</strong> This implementation of
- org.apache.hadoop.mapreduce.InputFormat provides equivalence
- with Legacy DistCp in balancing load across maps.
- The aim of the UniformSizeInputFormat is to make each map copy
- roughly the same number of bytes. Apropos, the listing file is
- split into groups of paths, such that the sum of file-sizes in
- each InputSplit is nearly equal to every other map. The splitting
- isn't always perfect, but its trivial implementation keeps the
- setup-time low.</li>
- <li><strong>DynamicInputFormat and DynamicRecordReader:</strong>
- <p> The DynamicInputFormat implements org.apache.hadoop.mapreduce.InputFormat,
- and is new to DistCp. The listing-file is split into several
- "chunk-files", the exact number of chunk-files being a multiple
- of the number of maps requested for in the Hadoop Job. Each map
- task is "assigned" one of the chunk-files (by renaming the chunk
- to the task's id), before the Job is launched.</p>
- <p>Paths are read from each chunk using the DynamicRecordReader,
- and processed in the CopyMapper. After all the paths in a chunk
- are processed, the current chunk is deleted and a new chunk is
- acquired. The process continues until no more chunks are
- available.</p>
- <p>This "dynamic" approach allows faster map-tasks to consume
- more paths than slower ones, thus speeding up the DistCp job
- overall. </p>
- </li>
- <li><strong>CopyMapper:</strong> This class implements the physical
- file-copy. The input-paths are checked against the input-options
- (specified in the Job's Configuration), to determine whether a
- file needs copy. A file will be copied only if at least one of
- the following is true:
- <ul>
- <li>A file with the same name doesn't exist at target.</li>
- <li>A file with the same name exists at target, but has a
- different file size.</li>
- <li>A file with the same name exists at target, but has a
- different checksum, and -skipcrccheck isn't mentioned.</li>
- <li>A file with the same name exists at target, but -overwrite
- is specified.</li>
- <li>A file with the same name exists at target, but differs in
- block-size (and block-size needs to be preserved.</li>
- </ul>
- </li>
- <li><strong>CopyCommitter:</strong>
- This class is responsible for the commit-phase of the DistCp
- job, including:
- <ul>
- <li>Preservation of directory-permissions (if specified in the
- options)</li>
- <li>Clean-up of temporary-files, work-directories, etc.</li>
- </ul>
- </li>
- </ul>
- </subsection>
- </section>
- </body>
- </document>
|