DistributedCacheDeploy.apt.vm 6.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118
  1. ~~ Licensed under the Apache License, Version 2.0 (the "License");
  2. ~~ you may not use this file except in compliance with the License.
  3. ~~ You may obtain a copy of the License at
  4. ~~
  5. ~~ http://www.apache.org/licenses/LICENSE-2.0
  6. ~~
  7. ~~ Unless required by applicable law or agreed to in writing, software
  8. ~~ distributed under the License is distributed on an "AS IS" BASIS,
  9. ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  10. ~~ See the License for the specific language governing permissions and
  11. ~~ limitations under the License. See accompanying LICENSE file.
  12. ---
  13. Hadoop Map Reduce Next Generation-${project.version} - Distributed Cache Deploy
  14. ---
  15. ---
  16. ${maven.build.timestamp}
  17. Hadoop MapReduce Next Generation - Distributed Cache Deploy
  18. * Introduction
  19. The MapReduce application framework has rudimentary support for deploying a
  20. new version of the MapReduce framework via the distributed cache. By setting
  21. the appropriate configuration properties, users can run a different version
  22. of MapReduce than the one initially deployed to the cluster. For example,
  23. cluster administrators can place multiple versions of MapReduce in HDFS and
  24. configure <<<mapred-site.xml>>> to specify which version jobs will use by
  25. default. This allows the administrators to perform a rolling upgrade of the
  26. MapReduce framework under certain conditions.
  27. * Preconditions and Limitations
  28. The support for deploying the MapReduce framework via the distributed cache
  29. currently does not address the job client code used to submit and query
  30. jobs. It also does not address the <<<ShuffleHandler>>> code that runs as an
  31. auxilliary service within each NodeManager. As a result the following
  32. limitations apply to MapReduce versions that can be successfully deployed via
  33. the distributed cache in a rolling upgrade fashion:
  34. * The MapReduce version must be compatible with the job client code used to
  35. submit and query jobs. If it is incompatible then the job client must be
  36. upgraded separately on any node from which jobs using the new MapReduce
  37. version will be submitted or queried.
  38. * The MapReduce version must be compatible with the configuration files used
  39. by the job client submitting the jobs. If it is incompatible with that
  40. configuration (e.g.: a new property must be set or an existing property
  41. value changed) then the configuration must be updated first.
  42. * The MapReduce version must be compatible with the <<<ShuffleHandler>>>
  43. version running on the nodes in the cluster. If it is incompatible then the
  44. new <<<ShuffleHandler>>> code must be deployed to all the nodes in the
  45. cluster, and the NodeManagers must be restarted to pick up the new
  46. <<<ShuffleHandler>>> code.
  47. * Deploying a New MapReduce Version via the Distributed Cache
  48. Deploying a new MapReduce version consists of three steps:
  49. [[1]] Upload the MapReduce archive to a location that can be accessed by the
  50. job submission client. Ideally the archive should be on the cluster's default
  51. filesystem at a publicly-readable path. See the archive location discussion
  52. below for more details.
  53. [[2]] Configure <<<mapreduce.application.framework.path>>> to point to the
  54. location where the archive is located. As when specifying distributed cache
  55. files for a job, this is a URL that also supports creating an alias for the
  56. archive if a URL fragment is specified. For example,
  57. <<<hdfs:/mapred/framework/hadoop-mapreduce-2.1.1.tar.gz#mrframework>>> will
  58. be localized as <<<mrframework>>> rather than
  59. <<<hadoop-mapreduce-2.1.1.tar.gz>>>.
  60. [[3]] Configure <<<mapreduce.application.classpath>>> to set the proper
  61. classpath to use with the MapReduce archive configured above. NOTE: An error
  62. occurs if <<<mapreduce.application.framework.path>>> is configured but
  63. <<<mapreduce.application.classpath>>> does not reference the base name of the
  64. archive path or the alias if an alias was specified.
  65. ** Location of the MapReduce Archive and How It Affects Job Performance
  66. Note that the location of the MapReduce archive can be critical to job
  67. submission and job startup performance. If the archive is not located on the
  68. cluster's default filesystem then it will be copied to the job staging
  69. directory for each job and localized to each node where the job's tasks
  70. run. This will slow down job submission and task startup performance.
  71. If the archive is located on the default filesystem then the job client will
  72. not upload the archive to the job staging directory for each job
  73. submission. However if the archive path is not readable by all cluster users
  74. then the archive will be localized separately for each user on each node
  75. where tasks execute. This can cause unnecessary duplication in the
  76. distributed cache.
  77. When working with a large cluster it can be important to increase the
  78. replication factor of the archive to increase its availability. This will
  79. spread the load when the nodes in the cluster localize the archive for the
  80. first time.
  81. * MapReduce Archives and Classpath Configuration
  82. Setting a proper classpath for the MapReduce archive depends upon the
  83. composition of the archive and whether it has any additional dependencies.
  84. For example, the archive can contain not only the MapReduce jars but also the
  85. necessary YARN, HDFS, and Hadoop Common jars and all other dependencies. In
  86. that case, <<<mapreduce.application.classpath>>> would be configured to
  87. something like the following example, where the archive basename is
  88. hadoop-mapreduce-2.1.1.tar.gz and the archive is organized internally similar
  89. to the standard Hadoop distribution archive:
  90. <<<$HADOOP_CONF_DIR,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/mapreduce/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/mapreduce/lib/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/common/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/common/lib/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/yarn/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/yarn/lib/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/hdfs/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/hdfs/lib/*>>>
  91. Another possible approach is to have the archive consist of just the
  92. MapReduce jars and have the remaining dependencies picked up from the Hadoop
  93. distribution installed on the nodes. In that case, the above example would
  94. change to something like the following:
  95. <<<$HADOOP_CONF_DIR,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/mapreduce/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/mapreduce/lib/*,$HADOOP_COMMON_HOME/share/hadoop/common/*,$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*>>>