123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118 |
- ~~ Licensed under the Apache License, Version 2.0 (the "License");
- ~~ you may not use this file except in compliance with the License.
- ~~ You may obtain a copy of the License at
- ~~
- ~~ http://www.apache.org/licenses/LICENSE-2.0
- ~~
- ~~ Unless required by applicable law or agreed to in writing, software
- ~~ distributed under the License is distributed on an "AS IS" BASIS,
- ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- ~~ See the License for the specific language governing permissions and
- ~~ limitations under the License. See accompanying LICENSE file.
- ---
- Hadoop Map Reduce Next Generation-${project.version} - Distributed Cache Deploy
- ---
- ---
- ${maven.build.timestamp}
- Hadoop MapReduce Next Generation - Distributed Cache Deploy
- * Introduction
- The MapReduce application framework has rudimentary support for deploying a
- new version of the MapReduce framework via the distributed cache. By setting
- the appropriate configuration properties, users can run a different version
- of MapReduce than the one initially deployed to the cluster. For example,
- cluster administrators can place multiple versions of MapReduce in HDFS and
- configure <<<mapred-site.xml>>> to specify which version jobs will use by
- default. This allows the administrators to perform a rolling upgrade of the
- MapReduce framework under certain conditions.
- * Preconditions and Limitations
- The support for deploying the MapReduce framework via the distributed cache
- currently does not address the job client code used to submit and query
- jobs. It also does not address the <<<ShuffleHandler>>> code that runs as an
- auxilliary service within each NodeManager. As a result the following
- limitations apply to MapReduce versions that can be successfully deployed via
- the distributed cache in a rolling upgrade fashion:
- * The MapReduce version must be compatible with the job client code used to
- submit and query jobs. If it is incompatible then the job client must be
- upgraded separately on any node from which jobs using the new MapReduce
- version will be submitted or queried.
- * The MapReduce version must be compatible with the configuration files used
- by the job client submitting the jobs. If it is incompatible with that
- configuration (e.g.: a new property must be set or an existing property
- value changed) then the configuration must be updated first.
- * The MapReduce version must be compatible with the <<<ShuffleHandler>>>
- version running on the nodes in the cluster. If it is incompatible then the
- new <<<ShuffleHandler>>> code must be deployed to all the nodes in the
- cluster, and the NodeManagers must be restarted to pick up the new
- <<<ShuffleHandler>>> code.
- * Deploying a New MapReduce Version via the Distributed Cache
- Deploying a new MapReduce version consists of three steps:
- [[1]] Upload the MapReduce archive to a location that can be accessed by the
- job submission client. Ideally the archive should be on the cluster's default
- filesystem at a publicly-readable path. See the archive location discussion
- below for more details.
- [[2]] Configure <<<mapreduce.application.framework.path>>> to point to the
- location where the archive is located. As when specifying distributed cache
- files for a job, this is a URL that also supports creating an alias for the
- archive if a URL fragment is specified. For example,
- <<<hdfs:/mapred/framework/hadoop-mapreduce-2.1.1.tar.gz#mrframework>>> will
- be localized as <<<mrframework>>> rather than
- <<<hadoop-mapreduce-2.1.1.tar.gz>>>.
- [[3]] Configure <<<mapreduce.application.classpath>>> to set the proper
- classpath to use with the MapReduce archive configured above. NOTE: An error
- occurs if <<<mapreduce.application.framework.path>>> is configured but
- <<<mapreduce.application.classpath>>> does not reference the base name of the
- archive path or the alias if an alias was specified.
- ** Location of the MapReduce Archive and How It Affects Job Performance
- Note that the location of the MapReduce archive can be critical to job
- submission and job startup performance. If the archive is not located on the
- cluster's default filesystem then it will be copied to the job staging
- directory for each job and localized to each node where the job's tasks
- run. This will slow down job submission and task startup performance.
- If the archive is located on the default filesystem then the job client will
- not upload the archive to the job staging directory for each job
- submission. However if the archive path is not readable by all cluster users
- then the archive will be localized separately for each user on each node
- where tasks execute. This can cause unnecessary duplication in the
- distributed cache.
- When working with a large cluster it can be important to increase the
- replication factor of the archive to increase its availability. This will
- spread the load when the nodes in the cluster localize the archive for the
- first time.
- * MapReduce Archives and Classpath Configuration
- Setting a proper classpath for the MapReduce archive depends upon the
- composition of the archive and whether it has any additional dependencies.
- For example, the archive can contain not only the MapReduce jars but also the
- necessary YARN, HDFS, and Hadoop Common jars and all other dependencies. In
- that case, <<<mapreduce.application.classpath>>> would be configured to
- something like the following example, where the archive basename is
- hadoop-mapreduce-2.1.1.tar.gz and the archive is organized internally similar
- to the standard Hadoop distribution archive:
- <<<$HADOOP_CONF_DIR,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/mapreduce/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/mapreduce/lib/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/common/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/common/lib/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/yarn/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/yarn/lib/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/hdfs/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/hdfs/lib/*>>>
- Another possible approach is to have the archive consist of just the
- MapReduce jars and have the remaining dependencies picked up from the Hadoop
- distribution installed on the nodes. In that case, the above example would
- change to something like the following:
- <<<$HADOOP_CONF_DIR,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/mapreduce/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/mapreduce/lib/*,$HADOOP_COMMON_HOME/share/hadoop/common/*,$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*>>>
|