|
@@ -0,0 +1,86 @@
|
|
|
+~~ Licensed under the Apache License, Version 2.0 (the "License");
|
|
|
+~~ you may not use this file except in compliance with the License.
|
|
|
+~~ You may obtain a copy of the License at
|
|
|
+~~
|
|
|
+~~ http://www.apache.org/licenses/LICENSE-2.0
|
|
|
+~~
|
|
|
+~~ Unless required by applicable law or agreed to in writing, software
|
|
|
+~~ distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
+~~ See the License for the specific language governing permissions and
|
|
|
+~~ limitations under the License. See accompanying LICENSE file.
|
|
|
+
|
|
|
+ ---
|
|
|
+ NodeManager Restart
|
|
|
+ ---
|
|
|
+ ---
|
|
|
+ ${maven.build.timestamp}
|
|
|
+
|
|
|
+NodeManager Restart
|
|
|
+
|
|
|
+* Introduction
|
|
|
+
|
|
|
+ This document gives an overview of NodeManager (NM) restart, a feature that
|
|
|
+ enables the NodeManager to be restarted without losing
|
|
|
+ the active containers running on the node. At a high level, the NM stores any
|
|
|
+ necessary state to a local state-store as it processes container-management
|
|
|
+ requests. When the NM restarts, it recovers by first loading state for
|
|
|
+ various subsystems and then letting those subsystems perform recovery using
|
|
|
+ the loaded state.
|
|
|
+
|
|
|
+* Enabling NM Restart
|
|
|
+
|
|
|
+ [[1]] To enable NM Restart functionality, set the following property in <<conf/yarn-site.xml>> to true:
|
|
|
+
|
|
|
+*--------------------------------------+--------------------------------------+
|
|
|
+|| Property || Value |
|
|
|
+*--------------------------------------+--------------------------------------+
|
|
|
+| <<<yarn.nodemanager.recovery.enabled>>> | |
|
|
|
+| | <<<true>>>, (default value is set to false) |
|
|
|
+*--------------------------------------+--------------------------------------+
|
|
|
+
|
|
|
+ [[2]] Configure a path to the local file-system directory where the
|
|
|
+ NodeManager can save its run state
|
|
|
+
|
|
|
+*--------------------------------------+--------------------------------------+
|
|
|
+|| Property || Description |
|
|
|
+*--------------------------------------+--------------------------------------+
|
|
|
+| <<<yarn.nodemanager.recovery.dir>>> | |
|
|
|
+| | The local filesystem directory in which the node manager will store state |
|
|
|
+| | when recovery is enabled. |
|
|
|
+| | The default value is set to |
|
|
|
+| | <<<${hadoop.tmp.dir}/yarn-nm-recovery>>>. |
|
|
|
+*--------------------------------------+--------------------------------------+
|
|
|
+
|
|
|
+ [[3]] Configure a valid RPC address for the NodeManager
|
|
|
+
|
|
|
+*--------------------------------------+--------------------------------------+
|
|
|
+|| Property || Description |
|
|
|
+*--------------------------------------+--------------------------------------+
|
|
|
+| <<<yarn.nodemanager.address>>> | |
|
|
|
+| | Ephemeral ports (port 0, which is default) cannot be used for the |
|
|
|
+| | NodeManager's RPC server specified via yarn.nodemanager.address as it can |
|
|
|
+| | make NM use different ports before and after a restart. This will break any |
|
|
|
+| | previously running clients that were communicating with the NM before |
|
|
|
+| | restart. Explicitly setting yarn.nodemanager.address to an address with |
|
|
|
+| | specific port number (for e.g 0.0.0.0:45454) is a precondition for enabling |
|
|
|
+| | NM restart. |
|
|
|
+*--------------------------------------+--------------------------------------+
|
|
|
+
|
|
|
+ [[4]] Auxiliary services
|
|
|
+
|
|
|
+ NodeManagers in a YARN cluster can be configured to run auxiliary services.
|
|
|
+ For a completely functional NM restart, YARN relies on any auxiliary service
|
|
|
+ configured to also support recovery. This usually includes (1) avoiding usage
|
|
|
+ of ephemeral ports so that previously running clients (in this case, usually
|
|
|
+ containers) are not disrupted after restart and (2) having the auxiliary
|
|
|
+ service itself support recoverability by reloading any previous state when
|
|
|
+ NodeManager restarts and reinitializes the auxiliary service.
|
|
|
+
|
|
|
+ A simple example for the above is the auxiliary service 'ShuffleHandler' for
|
|
|
+ MapReduce (MR). ShuffleHandler respects the above two requirements already,
|
|
|
+ so users/admins don't have do anything for it to support NM restart: (1) The
|
|
|
+ configuration property <<mapreduce.shuffle.port>> controls which port the
|
|
|
+ ShuffleHandler on a NodeManager host binds to, and it defaults to a
|
|
|
+ non-ephemeral port. (2) The ShuffleHandler service also already supports
|
|
|
+ recovery of previous state after NM restarts.
|