architecture.xml 9.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <!--
  3. Copyright 2002-2004 The Apache Software Foundation
  4. Licensed under the Apache License, Version 2.0 (the "License");
  5. you may not use this file except in compliance with the License.
  6. You may obtain a copy of the License at
  7. http://www.apache.org/licenses/LICENSE-2.0
  8. Unless required by applicable law or agreed to in writing, software
  9. distributed under the License is distributed on an "AS IS" BASIS,
  10. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  11. See the License for the specific language governing permissions and
  12. limitations under the License.
  13. -->
  14. <document xmlns="http://maven.apache.org/XDOC/2.0"
  15. xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  16. xsi:schemaLocation="http://maven.apache.org/XDOC/2.0 http://maven.apache.org/xsd/xdoc-2.0.xsd">
  17. <head>
  18. <title>Architecture of DistCp</title>
  19. </head>
  20. <body>
  21. <section name="Architecture">
  22. <p>The components of the new DistCp may be classified into the following
  23. categories: </p>
  24. <ul>
  25. <li>DistCp Driver</li>
  26. <li>Copy-listing generator</li>
  27. <li>Input-formats and Map-Reduce components</li>
  28. </ul>
  29. <subsection name="DistCp Driver">
  30. <p>The DistCp Driver components are responsible for:</p>
  31. <ul>
  32. <li>Parsing the arguments passed to the DistCp command on the
  33. command-line, via:
  34. <ul>
  35. <li>OptionsParser, and</li>
  36. <li>DistCpOptionsSwitch</li>
  37. </ul>
  38. </li>
  39. <li>Assembling the command arguments into an appropriate
  40. DistCpOptions object, and initializing DistCp. These arguments
  41. include:
  42. <ul>
  43. <li>Source-paths</li>
  44. <li>Target location</li>
  45. <li>Copy options (e.g. whether to update-copy, overwrite, which
  46. file-attributes to preserve, etc.)</li>
  47. </ul>
  48. </li>
  49. <li>Orchestrating the copy operation by:
  50. <ul>
  51. <li>Invoking the copy-listing-generator to create the list of
  52. files to be copied.</li>
  53. <li>Setting up and launching the Hadoop Map-Reduce Job to carry
  54. out the copy.</li>
  55. <li>Based on the options, either returning a handle to the
  56. Hadoop MR Job immediately, or waiting till completion.</li>
  57. </ul>
  58. </li>
  59. </ul>
  60. <br/>
  61. <p>The parser-elements are exercised only from the command-line (or if
  62. DistCp::run() is invoked). The DistCp class may also be used
  63. programmatically, by constructing the DistCpOptions object, and
  64. initializing a DistCp object appropriately.</p>
  65. </subsection>
  66. <subsection name="Copy-listing generator">
  67. <p>The copy-listing-generator classes are responsible for creating the
  68. list of files/directories to be copied from source. They examine
  69. the contents of the source-paths (files/directories, including
  70. wild-cards), and record all paths that need copy into a sequence-
  71. file, for consumption by the DistCp Hadoop Job. The main classes in
  72. this module include:</p>
  73. <ol>
  74. <li>CopyListing: The interface that should be implemented by any
  75. copy-listing-generator implementation. Also provides the factory
  76. method by which the concrete CopyListing implementation is
  77. chosen.</li>
  78. <li>SimpleCopyListing: An implementation of CopyListing that accepts
  79. multiple source paths (files/directories), and recursively lists
  80. all the individual files and directories under each, for
  81. copy.</li>
  82. <li>GlobbedCopyListing: Another implementation of CopyListing that
  83. expands wild-cards in the source paths.</li>
  84. <li>FileBasedCopyListing: An implementation of CopyListing that
  85. reads the source-path list from a specified file.</li>
  86. </ol>
  87. <p/>
  88. <p>Based on whether a source-file-list is specified in the
  89. DistCpOptions, the source-listing is generated in one of the
  90. following ways:</p>
  91. <ol>
  92. <li>If there's no source-file-list, the GlobbedCopyListing is used.
  93. All wild-cards are expanded, and all the expansions are
  94. forwarded to the SimpleCopyListing, which in turn constructs the
  95. listing (via recursive descent of each path). </li>
  96. <li>If a source-file-list is specified, the FileBasedCopyListing is
  97. used. Source-paths are read from the specified file, and then
  98. forwarded to the GlobbedCopyListing. The listing is then
  99. constructed as described above.</li>
  100. </ol>
  101. <br/>
  102. <p>One may customize the method by which the copy-listing is
  103. constructed by providing a custom implementation of the CopyListing
  104. interface. The behaviour of DistCp differs here from the legacy
  105. DistCp, in how paths are considered for copy. </p>
  106. <p>The legacy implementation only lists those paths that must
  107. definitely be copied on to target.
  108. E.g. if a file already exists at the target (and -overwrite isn't
  109. specified), the file isn't even considered in the Map-Reduce Copy
  110. Job. Determining this during setup (i.e. before the Map-Reduce Job)
  111. involves file-size and checksum-comparisons that are potentially
  112. time-consuming.</p>
  113. <p>The new DistCp postpones such checks until the Map-Reduce Job, thus
  114. reducing setup time. Performance is enhanced further since these
  115. checks are parallelized across multiple maps.</p>
  116. </subsection>
  117. <subsection name="Input-formats and Map-Reduce components">
  118. <p> The Input-formats and Map-Reduce components are responsible for
  119. the actual copy of files and directories from the source to the
  120. destination path. The listing-file created during copy-listing
  121. generation is consumed at this point, when the copy is carried
  122. out. The classes of interest here include:</p>
  123. <ul>
  124. <li><strong>UniformSizeInputFormat:</strong> This implementation of
  125. org.apache.hadoop.mapreduce.InputFormat provides equivalence
  126. with Legacy DistCp in balancing load across maps.
  127. The aim of the UniformSizeInputFormat is to make each map copy
  128. roughly the same number of bytes. Apropos, the listing file is
  129. split into groups of paths, such that the sum of file-sizes in
  130. each InputSplit is nearly equal to every other map. The splitting
  131. isn't always perfect, but its trivial implementation keeps the
  132. setup-time low.</li>
  133. <li><strong>DynamicInputFormat and DynamicRecordReader:</strong>
  134. <p> The DynamicInputFormat implements org.apache.hadoop.mapreduce.InputFormat,
  135. and is new to DistCp. The listing-file is split into several
  136. "chunk-files", the exact number of chunk-files being a multiple
  137. of the number of maps requested for in the Hadoop Job. Each map
  138. task is "assigned" one of the chunk-files (by renaming the chunk
  139. to the task's id), before the Job is launched.</p>
  140. <p>Paths are read from each chunk using the DynamicRecordReader,
  141. and processed in the CopyMapper. After all the paths in a chunk
  142. are processed, the current chunk is deleted and a new chunk is
  143. acquired. The process continues until no more chunks are
  144. available.</p>
  145. <p>This "dynamic" approach allows faster map-tasks to consume
  146. more paths than slower ones, thus speeding up the DistCp job
  147. overall. </p>
  148. </li>
  149. <li><strong>CopyMapper:</strong> This class implements the physical
  150. file-copy. The input-paths are checked against the input-options
  151. (specified in the Job's Configuration), to determine whether a
  152. file needs copy. A file will be copied only if at least one of
  153. the following is true:
  154. <ul>
  155. <li>A file with the same name doesn't exist at target.</li>
  156. <li>A file with the same name exists at target, but has a
  157. different file size.</li>
  158. <li>A file with the same name exists at target, but has a
  159. different checksum, and -skipcrccheck isn't mentioned.</li>
  160. <li>A file with the same name exists at target, but -overwrite
  161. is specified.</li>
  162. <li>A file with the same name exists at target, but differs in
  163. block-size (and block-size needs to be preserved.</li>
  164. </ul>
  165. </li>
  166. <li><strong>CopyCommitter:</strong>
  167. This class is responsible for the commit-phase of the DistCp
  168. job, including:
  169. <ul>
  170. <li>Preservation of directory-permissions (if specified in the
  171. options)</li>
  172. <li>Clean-up of temporary-files, work-directories, etc.</li>
  173. </ul>
  174. </li>
  175. </ul>
  176. </subsection>
  177. </section>
  178. </body>
  179. </document>