|
@@ -0,0 +1,120 @@
|
|
|
+~~ Licensed under the Apache License, Version 2.0 (the "License");
|
|
|
+~~ you may not use this file except in compliance with the License.
|
|
|
+~~ You may obtain a copy of the License at
|
|
|
+~~
|
|
|
+~~ http://www.apache.org/licenses/LICENSE-2.0
|
|
|
+~~
|
|
|
+~~ Unless required by applicable law or agreed to in writing, software
|
|
|
+~~ distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
+~~ See the License for the specific language governing permissions and
|
|
|
+~~ limitations under the License. See accompanying LICENSE file.
|
|
|
+
|
|
|
+ ---
|
|
|
+ Hadoop Map Reduce Next Generation-${project.version} - Distributed Cache Deploy
|
|
|
+ ---
|
|
|
+ ---
|
|
|
+ ${maven.build.timestamp}
|
|
|
+
|
|
|
+Hadoop MapReduce Next Generation - Distributed Cache Deploy
|
|
|
+
|
|
|
+ \[ {{{./index.html}Go Back}} \]
|
|
|
+
|
|
|
+* Introduction
|
|
|
+
|
|
|
+ The MapReduce application framework has rudimentary support for deploying a
|
|
|
+ new version of the MapReduce framework via the distributed cache. By setting
|
|
|
+ the appropriate configuration properties, users can run a different version
|
|
|
+ of MapReduce than the one initially deployed to the cluster. For example,
|
|
|
+ cluster administrators can place multiple versions of MapReduce in HDFS and
|
|
|
+ configure <<<mapred-site.xml>>> to specify which version jobs will use by
|
|
|
+ default. This allows the administrators to perform a rolling upgrade of the
|
|
|
+ MapReduce framework under certain conditions.
|
|
|
+
|
|
|
+* Preconditions and Limitations
|
|
|
+
|
|
|
+ The support for deploying the MapReduce framework via the distributed cache
|
|
|
+ currently does not address the job client code used to submit and query
|
|
|
+ jobs. It also does not address the <<<ShuffleHandler>>> code that runs as an
|
|
|
+ auxilliary service within each NodeManager. As a result the following
|
|
|
+ limitations apply to MapReduce versions that can be successfully deployed via
|
|
|
+ the distributed cache in a rolling upgrade fashion:
|
|
|
+
|
|
|
+ * The MapReduce version must be compatible with the job client code used to
|
|
|
+ submit and query jobs. If it is incompatible then the job client must be
|
|
|
+ upgraded separately on any node from which jobs using the new MapReduce
|
|
|
+ version will be submitted or queried.
|
|
|
+
|
|
|
+ * The MapReduce version must be compatible with the configuration files used
|
|
|
+ by the job client submitting the jobs. If it is incompatible with that
|
|
|
+ configuration (e.g.: a new property must be set or an existing property
|
|
|
+ value changed) then the configuration must be updated first.
|
|
|
+
|
|
|
+ * The MapReduce version must be compatible with the <<<ShuffleHandler>>>
|
|
|
+ version running on the nodes in the cluster. If it is incompatible then the
|
|
|
+ new <<<ShuffleHandler>>> code must be deployed to all the nodes in the
|
|
|
+ cluster, and the NodeManagers must be restarted to pick up the new
|
|
|
+ <<<ShuffleHandler>>> code.
|
|
|
+
|
|
|
+* Deploying a New MapReduce Version via the Distributed Cache
|
|
|
+
|
|
|
+ Deploying a new MapReduce version consists of three steps:
|
|
|
+
|
|
|
+ [[1]] Upload the MapReduce archive to a location that can be accessed by the
|
|
|
+ job submission client. Ideally the archive should be on the cluster's default
|
|
|
+ filesystem at a publicly-readable path. See the archive location discussion
|
|
|
+ below for more details.
|
|
|
+
|
|
|
+ [[2]] Configure <<<mapreduce.application.framework.path>>> to point to the
|
|
|
+ location where the archive is located. As when specifying distributed cache
|
|
|
+ files for a job, this is a URL that also supports creating an alias for the
|
|
|
+ archive if a URL fragment is specified. For example,
|
|
|
+ <<<hdfs:/mapred/framework/hadoop-mapreduce-2.1.1.tar.gz#mrframework>>> will
|
|
|
+ be localized as <<<mrframework>>> rather than
|
|
|
+ <<<hadoop-mapreduce-2.1.1.tar.gz>>>.
|
|
|
+
|
|
|
+ [[3]] Configure <<<mapreduce.application.classpath>>> to set the proper
|
|
|
+ classpath to use with the MapReduce archive configured above. NOTE: An error
|
|
|
+ occurs if <<<mapreduce.application.framework.path>>> is configured but
|
|
|
+ <<<mapreduce.application.classpath>>> does not reference the base name of the
|
|
|
+ archive path or the alias if an alias was specified.
|
|
|
+
|
|
|
+** Location of the MapReduce Archive and How It Affects Job Performance
|
|
|
+
|
|
|
+ Note that the location of the MapReduce archive can be critical to job
|
|
|
+ submission and job startup performance. If the archive is not located on the
|
|
|
+ cluster's default filesystem then it will be copied to the job staging
|
|
|
+ directory for each job and localized to each node where the job's tasks
|
|
|
+ run. This will slow down job submission and task startup performance.
|
|
|
+
|
|
|
+ If the archive is located on the default filesystem then the job client will
|
|
|
+ not upload the archive to the job staging directory for each job
|
|
|
+ submission. However if the archive path is not readable by all cluster users
|
|
|
+ then the archive will be localized separately for each user on each node
|
|
|
+ where tasks execute. This can cause unnecessary duplication in the
|
|
|
+ distributed cache.
|
|
|
+
|
|
|
+ When working with a large cluster it can be important to increase the
|
|
|
+ replication factor of the archive to increase its availability. This will
|
|
|
+ spread the load when the nodes in the cluster localize the archive for the
|
|
|
+ first time.
|
|
|
+
|
|
|
+* MapReduce Archives and Classpath Configuration
|
|
|
+
|
|
|
+ Setting a proper classpath for the MapReduce archive depends upon the
|
|
|
+ composition of the archive and whether it has any additional dependencies.
|
|
|
+ For example, the archive can contain not only the MapReduce jars but also the
|
|
|
+ necessary YARN, HDFS, and Hadoop Common jars and all other dependencies. In
|
|
|
+ that case, <<<mapreduce.application.classpath>>> would be configured to
|
|
|
+ something like the following example, where the archive basename is
|
|
|
+ hadoop-mapreduce-2.1.1.tar.gz and the archive is organized internally similar
|
|
|
+ to the standard Hadoop distribution archive:
|
|
|
+
|
|
|
+ <<<$HADOOP_CONF_DIR,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/mapreduce/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/mapreduce/lib/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/common/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/common/lib/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/yarn/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/yarn/lib/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/hdfs/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/hdfs/lib/*>>>
|
|
|
+
|
|
|
+ Another possible approach is to have the archive consist of just the
|
|
|
+ MapReduce jars and have the remaining dependencies picked up from the Hadoop
|
|
|
+ distribution installed on the nodes. In that case, the above example would
|
|
|
+ change to something like the following:
|
|
|
+
|
|
|
+ <<<$HADOOP_CONF_DIR,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/mapreduce/*,$PWD/hadoop-mapreduce-2.1.1.tar.gz/hadoop-mapreduce-2.1.1/share/hadoop/mapreduce/lib/*,$HADOOP_COMMON_HOME/share/hadoop/common/*,$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*>>>
|