|
@@ -0,0 +1,767 @@
|
|
|
+~~ Licensed under the Apache License, Version 2.0 (the "License");
|
|
|
+~~ you may not use this file except in compliance with the License.
|
|
|
+~~ You may obtain a copy of the License at
|
|
|
+~~
|
|
|
+~~ http://www.apache.org/licenses/LICENSE-2.0
|
|
|
+~~
|
|
|
+~~ Unless required by applicable law or agreed to in writing, software
|
|
|
+~~ distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
+~~ See the License for the specific language governing permissions and
|
|
|
+~~ limitations under the License. See accompanying LICENSE file.
|
|
|
+
|
|
|
+ ---
|
|
|
+ Hadoop Distributed File System-${project.version} - High Availability
|
|
|
+ ---
|
|
|
+ ---
|
|
|
+ ${maven.build.timestamp}
|
|
|
+
|
|
|
+HDFS High Availability Using the Quorum Journal Manager
|
|
|
+
|
|
|
+ \[ {{{./index.html}Go Back}} \]
|
|
|
+
|
|
|
+%{toc|section=1|fromDepth=0}
|
|
|
+
|
|
|
+* {Purpose}
|
|
|
+
|
|
|
+ This guide provides an overview of the HDFS High Availability (HA) feature
|
|
|
+ and how to configure and manage an HA HDFS cluster, using the Quorum Journal
|
|
|
+ Manager (QJM) feature.
|
|
|
+
|
|
|
+ This document assumes that the reader has a general understanding of
|
|
|
+ general components and node types in an HDFS cluster. Please refer to the
|
|
|
+ HDFS Architecture guide for details.
|
|
|
+
|
|
|
+* {Note: Using the Quorum Journal Manager or Conventional Shared Storage}
|
|
|
+
|
|
|
+ This guide discusses how to configure and use HDFS HA using the Quorum
|
|
|
+ Journal Manager (QJM) to share edit logs between the Active and Standby
|
|
|
+ NameNodes. For information on how to configure HDFS HA using NFS for shared
|
|
|
+ storage instead of the QJM, please see
|
|
|
+ {{{./HDFSHighAvailabilityWithNFS.html}this alternative guide.}}
|
|
|
+
|
|
|
+* {Background}
|
|
|
+
|
|
|
+ Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in
|
|
|
+ an HDFS cluster. Each cluster had a single NameNode, and if that machine or
|
|
|
+ process became unavailable, the cluster as a whole would be unavailable
|
|
|
+ until the NameNode was either restarted or brought up on a separate machine.
|
|
|
+
|
|
|
+ This impacted the total availability of the HDFS cluster in two major ways:
|
|
|
+
|
|
|
+ * In the case of an unplanned event such as a machine crash, the cluster would
|
|
|
+ be unavailable until an operator restarted the NameNode.
|
|
|
+
|
|
|
+ * Planned maintenance events such as software or hardware upgrades on the
|
|
|
+ NameNode machine would result in windows of cluster downtime.
|
|
|
+
|
|
|
+ The HDFS High Availability feature addresses the above problems by providing
|
|
|
+ the option of running two redundant NameNodes in the same cluster in an
|
|
|
+ Active/Passive configuration with a hot standby. This allows a fast failover to
|
|
|
+ a new NameNode in the case that a machine crashes, or a graceful
|
|
|
+ administrator-initiated failover for the purpose of planned maintenance.
|
|
|
+
|
|
|
+* {Architecture}
|
|
|
+
|
|
|
+ In a typical HA cluster, two separate machines are configured as NameNodes.
|
|
|
+ At any point in time, exactly one of the NameNodes is in an <Active> state,
|
|
|
+ and the other is in a <Standby> state. The Active NameNode is responsible
|
|
|
+ for all client operations in the cluster, while the Standby is simply acting
|
|
|
+ as a slave, maintaining enough state to provide a fast failover if
|
|
|
+ necessary.
|
|
|
+
|
|
|
+ In order for the Standby node to keep its state synchronized with the Active
|
|
|
+ node, both nodes communicate with a group of separate daemons called
|
|
|
+ "JournalNodes" (JNs). When any namespace modification is performed by the
|
|
|
+ Active node, it durably logs a record of the modification to a majority of
|
|
|
+ these JNs. The Standby node is capable of reading the edits from the JNs, and
|
|
|
+ is constantly watching them for changes to the edit log. As the Standby Node
|
|
|
+ sees the edits, it applies them to its own namespace. In the event of a
|
|
|
+ failover, the Standby will ensure that it has read all of the edits from the
|
|
|
+ JounalNodes before promoting itself to the Active state. This ensures that the
|
|
|
+ namespace state is fully synchronized before a failover occurs.
|
|
|
+
|
|
|
+ In order to provide a fast failover, it is also necessary that the Standby node
|
|
|
+ have up-to-date information regarding the location of blocks in the cluster.
|
|
|
+ In order to achieve this, the DataNodes are configured with the location of
|
|
|
+ both NameNodes, and send block location information and heartbeats to both.
|
|
|
+
|
|
|
+ It is vital for the correct operation of an HA cluster that only one of the
|
|
|
+ NameNodes be Active at a time. Otherwise, the namespace state would quickly
|
|
|
+ diverge between the two, risking data loss or other incorrect results. In
|
|
|
+ order to ensure this property and prevent the so-called "split-brain scenario,"
|
|
|
+ the JournalNodes will only ever allow a single NameNode to be a writer at a
|
|
|
+ time. During a failover, the NameNode which is to become active will simply
|
|
|
+ take over the role of writing to the JournalNodes, which will effectively
|
|
|
+ prevent the other NameNode from continuing in the Active state, allowing the
|
|
|
+ new Active to safely proceed with failover.
|
|
|
+
|
|
|
+* {Hardware resources}
|
|
|
+
|
|
|
+ In order to deploy an HA cluster, you should prepare the following:
|
|
|
+
|
|
|
+ * <<NameNode machines>> - the machines on which you run the Active and
|
|
|
+ Standby NameNodes should have equivalent hardware to each other, and
|
|
|
+ equivalent hardware to what would be used in a non-HA cluster.
|
|
|
+
|
|
|
+ * <<JournalNode machines>> - the machines on which you run the JournalNodes.
|
|
|
+ The JournalNode daemon is relatively lightweight, so these daemons may
|
|
|
+ reasonably be collocated on machines with other Hadoop daemons, for example
|
|
|
+ NameNodes, the JobTracker, or the YARN ResourceManager. <<Note:>> There
|
|
|
+ must be at least 3 JournalNode daemons, since edit log modifications must be
|
|
|
+ written to a majority of JNs. This will allow the system to tolerate the
|
|
|
+ failure of a single machine. You may also run more than 3 JournalNodes, but
|
|
|
+ in order to actually increase the number of failures the system can tolerate,
|
|
|
+ you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when
|
|
|
+ running with N JournalNodes, the system can tolerate at most (N - 1) / 2
|
|
|
+ failures and continue to function normally.
|
|
|
+
|
|
|
+ Note that, in an HA cluster, the Standby NameNode also performs checkpoints of
|
|
|
+ the namespace state, and thus it is not necessary to run a Secondary NameNode,
|
|
|
+ CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an
|
|
|
+ error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster
|
|
|
+ to be HA-enabled to reuse the hardware which they had previously dedicated to
|
|
|
+ the Secondary NameNode.
|
|
|
+
|
|
|
+* {Deployment}
|
|
|
+
|
|
|
+** Configuration overview
|
|
|
+
|
|
|
+ Similar to Federation configuration, HA configuration is backward compatible
|
|
|
+ and allows existing single NameNode configurations to work without change.
|
|
|
+ The new configuration is designed such that all the nodes in the cluster may
|
|
|
+ have the same configuration without the need for deploying different
|
|
|
+ configuration files to different machines based on the type of the node.
|
|
|
+
|
|
|
+ Like HDFS Federation, HA clusters reuse the <<<nameservice ID>>> to identify a
|
|
|
+ single HDFS instance that may in fact consist of multiple HA NameNodes. In
|
|
|
+ addition, a new abstraction called <<<NameNode ID>>> is added with HA. Each
|
|
|
+ distinct NameNode in the cluster has a different NameNode ID to distinguish it.
|
|
|
+ To support a single configuration file for all of the NameNodes, the relevant
|
|
|
+ configuration parameters are suffixed with the <<nameservice ID>> as well as
|
|
|
+ the <<NameNode ID>>.
|
|
|
+
|
|
|
+** Configuration details
|
|
|
+
|
|
|
+ To configure HA NameNodes, you must add several configuration options to your
|
|
|
+ <<hdfs-site.xml>> configuration file.
|
|
|
+
|
|
|
+ The order in which you set these configurations is unimportant, but the values
|
|
|
+ you choose for <<dfs.nameservices>> and
|
|
|
+ <<dfs.ha.namenodes.[nameservice ID]>> will determine the keys of those that
|
|
|
+ follow. Thus, you should decide on these values before setting the rest of the
|
|
|
+ configuration options.
|
|
|
+
|
|
|
+ * <<dfs.nameservices>> - the logical name for this new nameservice
|
|
|
+
|
|
|
+ Choose a logical name for this nameservice, for example "mycluster", and use
|
|
|
+ this logical name for the value of this config option. The name you choose is
|
|
|
+ arbitrary. It will be used both for configuration and as the authority
|
|
|
+ component of absolute HDFS paths in the cluster.
|
|
|
+
|
|
|
+ <<Note:>> If you are also using HDFS Federation, this configuration setting
|
|
|
+ should also include the list of other nameservices, HA or otherwise, as a
|
|
|
+ comma-separated list.
|
|
|
+
|
|
|
+----
|
|
|
+<property>
|
|
|
+ <name>dfs.nameservices</name>
|
|
|
+ <value>mycluster</value>
|
|
|
+</property>
|
|
|
+----
|
|
|
+
|
|
|
+ * <<dfs.ha.namenodes.[nameservice ID]>> - unique identifiers for each NameNode in the nameservice
|
|
|
+
|
|
|
+ Configure with a list of comma-separated NameNode IDs. This will be used by
|
|
|
+ DataNodes to determine all the NameNodes in the cluster. For example, if you
|
|
|
+ used "mycluster" as the nameservice ID previously, and you wanted to use "nn1"
|
|
|
+ and "nn2" as the individual IDs of the NameNodes, you would configure this as
|
|
|
+ such:
|
|
|
+
|
|
|
+----
|
|
|
+<property>
|
|
|
+ <name>dfs.ha.namenodes.mycluster</name>
|
|
|
+ <value>nn1,nn2</value>
|
|
|
+</property>
|
|
|
+----
|
|
|
+
|
|
|
+ <<Note:>> Currently, only a maximum of two NameNodes may be configured per
|
|
|
+ nameservice.
|
|
|
+
|
|
|
+ * <<dfs.namenode.rpc-address.[nameservice ID].[name node ID]>> - the fully-qualified RPC address for each NameNode to listen on
|
|
|
+
|
|
|
+ For both of the previously-configured NameNode IDs, set the full address and
|
|
|
+ IPC port of the NameNode processs. Note that this results in two separate
|
|
|
+ configuration options. For example:
|
|
|
+
|
|
|
+----
|
|
|
+<property>
|
|
|
+ <name>dfs.namenode.rpc-address.mycluster.nn1</name>
|
|
|
+ <value>machine1.example.com:8020</value>
|
|
|
+</property>
|
|
|
+<property>
|
|
|
+ <name>dfs.namenode.rpc-address.mycluster.nn2</name>
|
|
|
+ <value>machine2.example.com:8020</value>
|
|
|
+</property>
|
|
|
+----
|
|
|
+
|
|
|
+ <<Note:>> You may similarly configure the "<<servicerpc-address>>" setting if
|
|
|
+ you so desire.
|
|
|
+
|
|
|
+ * <<dfs.namenode.http-address.[nameservice ID].[name node ID]>> - the fully-qualified HTTP address for each NameNode to listen on
|
|
|
+
|
|
|
+ Similarly to <rpc-address> above, set the addresses for both NameNodes' HTTP
|
|
|
+ servers to listen on. For example:
|
|
|
+
|
|
|
+----
|
|
|
+<property>
|
|
|
+ <name>dfs.namenode.http-address.mycluster.nn1</name>
|
|
|
+ <value>machine1.example.com:50070</value>
|
|
|
+</property>
|
|
|
+<property>
|
|
|
+ <name>dfs.namenode.http-address.mycluster.nn2</name>
|
|
|
+ <value>machine2.example.com:50070</value>
|
|
|
+</property>
|
|
|
+----
|
|
|
+
|
|
|
+ <<Note:>> If you have Hadoop's security features enabled, you should also set
|
|
|
+ the <https-address> similarly for each NameNode.
|
|
|
+
|
|
|
+ * <<dfs.namenode.shared.edits.dir>> - the URI which identifies the group of JNs where the NameNodes will write/read edits
|
|
|
+
|
|
|
+ This is where one configures the addresses of the JournalNodes which provide
|
|
|
+ the shared edits storage, written to by the Active nameNode and read by the
|
|
|
+ Standby NameNode to stay up-to-date with all the file system changes the Active
|
|
|
+ NameNode makes. Though you must specify several JournalNode addresses,
|
|
|
+ <<you should only configure one of these URIs.>> The URI should be of the form:
|
|
|
+ "qjournal://<host1:port1>;<host2:port2>;<host3:port3>/<journalId>". The Journal
|
|
|
+ ID is a unique identifier for this nameservice, which allows a single set of
|
|
|
+ JournalNodes to provide storage for multiple federated namesystems. Though not
|
|
|
+ a requirement, it's a good idea to reuse the nameservice ID for the journal
|
|
|
+ identifier.
|
|
|
+
|
|
|
+ For example, if the JournalNodes for this cluster were running on the
|
|
|
+ machines "node1.example.com", "node2.example.com", and "node3.example.com" and
|
|
|
+ the nameservice ID were "mycluster", you would use the following as the value
|
|
|
+ for this setting (the default port for the JournalNode is 8485):
|
|
|
+
|
|
|
+----
|
|
|
+<property>
|
|
|
+ <name>dfs.namenode.shared.edits.dir</name>
|
|
|
+ <value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster</value>
|
|
|
+</property>
|
|
|
+----
|
|
|
+
|
|
|
+ * <<dfs.client.failover.proxy.provider.[nameservice ID]>> - the Java class that HDFS clients use to contact the Active NameNode
|
|
|
+
|
|
|
+ Configure the name of the Java class which will be used by the DFS Client to
|
|
|
+ determine which NameNode is the current Active, and therefore which NameNode is
|
|
|
+ currently serving client requests. The only implementation which currently
|
|
|
+ ships with Hadoop is the <<ConfiguredFailoverProxyProvider>>, so use this
|
|
|
+ unless you are using a custom one. For example:
|
|
|
+
|
|
|
+----
|
|
|
+<property>
|
|
|
+ <name>dfs.client.failover.proxy.provider.mycluster</name>
|
|
|
+ <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
|
|
|
+</property>
|
|
|
+----
|
|
|
+
|
|
|
+ * <<dfs.ha.fencing.methods>> - a list of scripts or Java classes which will be used to fence the Active NameNode during a failover
|
|
|
+
|
|
|
+ It is desirable for correctness of the system that only one NameNode be in
|
|
|
+ the Active state at any given time. <<Importantly, when using the Quorum
|
|
|
+ Journal Manager, only one NameNode will ever be allowed to write to the
|
|
|
+ JournalNodes, so there is no potential for corrupting the file system metadata
|
|
|
+ from a split-brain scenario.>> However, when a failover occurs, it is still
|
|
|
+ possible that the previous Active NameNode could serve read requests to
|
|
|
+ clients, which may be out of date until that NameNode shuts down when trying to
|
|
|
+ write to the JournalNodes. For this reason, it is still desirable to configure
|
|
|
+ some fencing methods even when using the Quorum Journal Manager. However, to
|
|
|
+ improve the availability of the system in the event the fencing mechanisms
|
|
|
+ fail, it is advisable to configure a fencing method which is guaranteed to
|
|
|
+ return success as the last fencing method in the list. Note that if you choose
|
|
|
+ to use no actual fencing methods, you still must configure something for this
|
|
|
+ setting, for example "<<<shell(/bin/true)>>>".
|
|
|
+
|
|
|
+ The fencing methods used during a failover are configured as a
|
|
|
+ carriage-return-separated list, which will be attempted in order until one
|
|
|
+ indicates that fencing has succeeded. There are two methods which ship with
|
|
|
+ Hadoop: <shell> and <sshfence>. For information on implementing your own custom
|
|
|
+ fencing method, see the <org.apache.hadoop.ha.NodeFencer> class.
|
|
|
+
|
|
|
+ * <<sshfence>> - SSH to the Active NameNode and kill the process
|
|
|
+
|
|
|
+ The <sshfence> option SSHes to the target node and uses <fuser> to kill the
|
|
|
+ process listening on the service's TCP port. In order for this fencing option
|
|
|
+ to work, it must be able to SSH to the target node without providing a
|
|
|
+ passphrase. Thus, one must also configure the
|
|
|
+ <<dfs.ha.fencing.ssh.private-key-files>> option, which is a
|
|
|
+ comma-separated list of SSH private key files. For example:
|
|
|
+
|
|
|
+---
|
|
|
+<property>
|
|
|
+ <name>dfs.ha.fencing.methods</name>
|
|
|
+ <value>sshfence</value>
|
|
|
+</property>
|
|
|
+
|
|
|
+<property>
|
|
|
+ <name>dfs.ha.fencing.ssh.private-key-files</name>
|
|
|
+ <value>/home/exampleuser/.ssh/id_rsa</value>
|
|
|
+</property>
|
|
|
+---
|
|
|
+
|
|
|
+ Optionally, one may configure a non-standard username or port to perform the
|
|
|
+ SSH. One may also configure a timeout, in milliseconds, for the SSH, after
|
|
|
+ which this fencing method will be considered to have failed. It may be
|
|
|
+ configured like so:
|
|
|
+
|
|
|
+---
|
|
|
+<property>
|
|
|
+ <name>dfs.ha.fencing.methods</name>
|
|
|
+ <value>sshfence([[username][:port]])</value>
|
|
|
+</property>
|
|
|
+<property>
|
|
|
+ <name>dfs.ha.fencing.ssh.connect-timeout</name>
|
|
|
+ <value>30000</value>
|
|
|
+</property>
|
|
|
+---
|
|
|
+
|
|
|
+ * <<shell>> - run an arbitrary shell command to fence the Active NameNode
|
|
|
+
|
|
|
+ The <shell> fencing method runs an arbitrary shell command. It may be
|
|
|
+ configured like so:
|
|
|
+
|
|
|
+---
|
|
|
+<property>
|
|
|
+ <name>dfs.ha.fencing.methods</name>
|
|
|
+ <value>shell(/path/to/my/script.sh arg1 arg2 ...)</value>
|
|
|
+</property>
|
|
|
+---
|
|
|
+
|
|
|
+ The string between '(' and ')' is passed directly to a bash shell and may not
|
|
|
+ include any closing parentheses.
|
|
|
+
|
|
|
+ The shell command will be run with an environment set up to contain all of the
|
|
|
+ current Hadoop configuration variables, with the '_' character replacing any
|
|
|
+ '.' characters in the configuration keys. The configuration used has already had
|
|
|
+ any namenode-specific configurations promoted to their generic forms -- for example
|
|
|
+ <<dfs_namenode_rpc-address>> will contain the RPC address of the target node, even
|
|
|
+ though the configuration may specify that variable as
|
|
|
+ <<dfs.namenode.rpc-address.ns1.nn1>>.
|
|
|
+
|
|
|
+ Additionally, the following variables referring to the target node to be fenced
|
|
|
+ are also available:
|
|
|
+
|
|
|
+*-----------------------:-----------------------------------+
|
|
|
+| $target_host | hostname of the node to be fenced |
|
|
|
+*-----------------------:-----------------------------------+
|
|
|
+| $target_port | IPC port of the node to be fenced |
|
|
|
+*-----------------------:-----------------------------------+
|
|
|
+| $target_address | the above two, combined as host:port |
|
|
|
+*-----------------------:-----------------------------------+
|
|
|
+| $target_nameserviceid | the nameservice ID of the NN to be fenced |
|
|
|
+*-----------------------:-----------------------------------+
|
|
|
+| $target_namenodeid | the namenode ID of the NN to be fenced |
|
|
|
+*-----------------------:-----------------------------------+
|
|
|
+
|
|
|
+ These environment variables may also be used as substitutions in the shell
|
|
|
+ command itself. For example:
|
|
|
+
|
|
|
+---
|
|
|
+<property>
|
|
|
+ <name>dfs.ha.fencing.methods</name>
|
|
|
+ <value>shell(/path/to/my/script.sh --nameservice=$target_nameserviceid $target_host:$target_port)</value>
|
|
|
+</property>
|
|
|
+---
|
|
|
+
|
|
|
+ If the shell command returns an exit
|
|
|
+ code of 0, the fencing is determined to be successful. If it returns any other
|
|
|
+ exit code, the fencing was not successful and the next fencing method in the
|
|
|
+ list will be attempted.
|
|
|
+
|
|
|
+ <<Note:>> This fencing method does not implement any timeout. If timeouts are
|
|
|
+ necessary, they should be implemented in the shell script itself (eg by forking
|
|
|
+ a subshell to kill its parent in some number of seconds).
|
|
|
+
|
|
|
+ * <<fs.defaultFS>> - the default path prefix used by the Hadoop FS client when none is given
|
|
|
+
|
|
|
+ Optionally, you may now configure the default path for Hadoop clients to use
|
|
|
+ the new HA-enabled logical URI. If you used "mycluster" as the nameservice ID
|
|
|
+ earlier, this will be the value of the authority portion of all of your HDFS
|
|
|
+ paths. This may be configured like so, in your <<core-site.xml>> file:
|
|
|
+
|
|
|
+---
|
|
|
+<property>
|
|
|
+ <name>fs.defaultFS</name>
|
|
|
+ <value>hdfs://mycluster</value>
|
|
|
+</property>
|
|
|
+---
|
|
|
+
|
|
|
+
|
|
|
+ * <<dfs.journalnode.edits.dir>> - the path where the JournalNode daemon will store its local state
|
|
|
+
|
|
|
+ This is the absolute path on the JournalNode machines where the edits and
|
|
|
+ other local state used by the JNs will be stored. You may only use a single
|
|
|
+ path for this configuration. Redundancy for this data is provided by running
|
|
|
+ multiple separate JournalNodes, or by configuring this directory on a
|
|
|
+ locally-attached RAID array. For example:
|
|
|
+
|
|
|
+---
|
|
|
+<property>
|
|
|
+ <name>dfs.journalnode.edits.dir</name>
|
|
|
+ <value>/path/to/journal/node/local/data</value>
|
|
|
+</property>
|
|
|
+---
|
|
|
+
|
|
|
+** Deployment details
|
|
|
+
|
|
|
+ After all of the necessary configuration options have been set, you must
|
|
|
+ start the JournalNode daemons on the set of machines where they will run. This
|
|
|
+ can be done by running the command "<hdfs-daemon.sh journalnode>" and waiting
|
|
|
+ for the daemon to start on each of the relevant machines.
|
|
|
+
|
|
|
+ Once the JournalNodes have been started, one must initially synchronize the
|
|
|
+ two HA NameNodes' on-disk metadata.
|
|
|
+
|
|
|
+ * If you are setting up a fresh HDFS cluster, you should first run the format
|
|
|
+ command (<hdfs namenode -format>) on one of NameNodes.
|
|
|
+
|
|
|
+ * If you have already formatted the NameNode, or are converting a
|
|
|
+ non-HA-enabled cluster to be HA-enabled, you should now copy over the
|
|
|
+ contents of your NameNode metadata directories to the other, unformatted
|
|
|
+ NameNode by running the command "<hdfs namenode -bootstrapStandby>" on the
|
|
|
+ unformatted NameNode. Running this command will also ensure that the
|
|
|
+ JournalNodes (as configured by <<dfs.namenode.shared.edits.dir>>) contain
|
|
|
+ sufficient edits transactions to be able to start both NameNodes.
|
|
|
+
|
|
|
+ * If you are converting a non-HA NameNode to be HA, you should run the
|
|
|
+ command "<hdfs -initializeSharedEdits>", which will initialize the
|
|
|
+ JournalNodes with the edits data from the local NameNode edits directories.
|
|
|
+
|
|
|
+ At this point you may start both of your HA NameNodes as you normally would
|
|
|
+ start a NameNode.
|
|
|
+
|
|
|
+ You can visit each of the NameNodes' web pages separately by browsing to their
|
|
|
+ configured HTTP addresses. You should notice that next to the configured
|
|
|
+ address will be the HA state of the NameNode (either "standby" or "active".)
|
|
|
+ Whenever an HA NameNode starts, it is initially in the Standby state.
|
|
|
+
|
|
|
+** Administrative commands
|
|
|
+
|
|
|
+ Now that your HA NameNodes are configured and started, you will have access
|
|
|
+ to some additional commands to administer your HA HDFS cluster. Specifically,
|
|
|
+ you should familiarize yourself with all of the subcommands of the "<hdfs
|
|
|
+ haadmin>" command. Running this command without any additional arguments will
|
|
|
+ display the following usage information:
|
|
|
+
|
|
|
+---
|
|
|
+Usage: DFSHAAdmin [-ns <nameserviceId>]
|
|
|
+ [-transitionToActive <serviceId>]
|
|
|
+ [-transitionToStandby <serviceId>]
|
|
|
+ [-failover [--forcefence] [--forceactive] <serviceId> <serviceId>]
|
|
|
+ [-getServiceState <serviceId>]
|
|
|
+ [-checkHealth <serviceId>]
|
|
|
+ [-help <command>]
|
|
|
+---
|
|
|
+
|
|
|
+ This guide describes high-level uses of each of these subcommands. For
|
|
|
+ specific usage information of each subcommand, you should run "<hdfs haadmin
|
|
|
+ -help <command>>".
|
|
|
+
|
|
|
+ * <<transitionToActive>> and <<transitionToStandby>> - transition the state of the given NameNode to Active or Standby
|
|
|
+
|
|
|
+ These subcommands cause a given NameNode to transition to the Active or Standby
|
|
|
+ state, respectively. <<These commands do not attempt to perform any fencing,
|
|
|
+ and thus should rarely be used.>> Instead, one should almost always prefer to
|
|
|
+ use the "<hdfs haadmin -failover>" subcommand.
|
|
|
+
|
|
|
+ * <<failover>> - initiate a failover between two NameNodes
|
|
|
+
|
|
|
+ This subcommand causes a failover from the first provided NameNode to the
|
|
|
+ second. If the first NameNode is in the Standby state, this command simply
|
|
|
+ transitions the second to the Active state without error. If the first NameNode
|
|
|
+ is in the Active state, an attempt will be made to gracefully transition it to
|
|
|
+ the Standby state. If this fails, the fencing methods (as configured by
|
|
|
+ <<dfs.ha.fencing.methods>>) will be attempted in order until one
|
|
|
+ succeeds. Only after this process will the second NameNode be transitioned to
|
|
|
+ the Active state. If no fencing method succeeds, the second NameNode will not
|
|
|
+ be transitioned to the Active state, and an error will be returned.
|
|
|
+
|
|
|
+ * <<getServiceState>> - determine whether the given NameNode is Active or Standby
|
|
|
+
|
|
|
+ Connect to the provided NameNode to determine its current state, printing
|
|
|
+ either "standby" or "active" to STDOUT appropriately. This subcommand might be
|
|
|
+ used by cron jobs or monitoring scripts which need to behave differently based
|
|
|
+ on whether the NameNode is currently Active or Standby.
|
|
|
+
|
|
|
+ * <<checkHealth>> - check the health of the given NameNode
|
|
|
+
|
|
|
+ Connect to the provided NameNode to check its health. The NameNode is capable
|
|
|
+ of performing some diagnostics on itself, including checking if internal
|
|
|
+ services are running as expected. This command will return 0 if the NameNode is
|
|
|
+ healthy, non-zero otherwise. One might use this command for monitoring
|
|
|
+ purposes.
|
|
|
+
|
|
|
+ <<Note:>> This is not yet implemented, and at present will always return
|
|
|
+ success, unless the given NameNode is completely down.
|
|
|
+
|
|
|
+* {Automatic Failover}
|
|
|
+
|
|
|
+** Introduction
|
|
|
+
|
|
|
+ The above sections describe how to configure manual failover. In that mode,
|
|
|
+ the system will not automatically trigger a failover from the active to the
|
|
|
+ standby NameNode, even if the active node has failed. This section describes
|
|
|
+ how to configure and deploy automatic failover.
|
|
|
+
|
|
|
+** Components
|
|
|
+
|
|
|
+ Automatic failover adds two new components to an HDFS deployment: a ZooKeeper
|
|
|
+ quorum, and the ZKFailoverController process (abbreviated as ZKFC).
|
|
|
+
|
|
|
+ Apache ZooKeeper is a highly available service for maintaining small amounts
|
|
|
+ of coordination data, notifying clients of changes in that data, and
|
|
|
+ monitoring clients for failures. The implementation of automatic HDFS failover
|
|
|
+ relies on ZooKeeper for the following things:
|
|
|
+
|
|
|
+ * <<Failure detection>> - each of the NameNode machines in the cluster
|
|
|
+ maintains a persistent session in ZooKeeper. If the machine crashes, the
|
|
|
+ ZooKeeper session will expire, notifying the other NameNode that a failover
|
|
|
+ should be triggered.
|
|
|
+
|
|
|
+ * <<Active NameNode election>> - ZooKeeper provides a simple mechanism to
|
|
|
+ exclusively elect a node as active. If the current active NameNode crashes,
|
|
|
+ another node may take a special exclusive lock in ZooKeeper indicating that
|
|
|
+ it should become the next active.
|
|
|
+
|
|
|
+ The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client
|
|
|
+ which also monitors and manages the state of the NameNode. Each of the
|
|
|
+ machines which runs a NameNode also runs a ZKFC, and that ZKFC is responsible
|
|
|
+ for:
|
|
|
+
|
|
|
+ * <<Health monitoring>> - the ZKFC pings its local NameNode on a periodic
|
|
|
+ basis with a health-check command. So long as the NameNode responds in a
|
|
|
+ timely fashion with a healthy status, the ZKFC considers the node
|
|
|
+ healthy. If the node has crashed, frozen, or otherwise entered an unhealthy
|
|
|
+ state, the health monitor will mark it as unhealthy.
|
|
|
+
|
|
|
+ * <<ZooKeeper session management>> - when the local NameNode is healthy, the
|
|
|
+ ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it
|
|
|
+ also holds a special "lock" znode. This lock uses ZooKeeper's support for
|
|
|
+ "ephemeral" nodes; if the session expires, the lock node will be
|
|
|
+ automatically deleted.
|
|
|
+
|
|
|
+ * <<ZooKeeper-based election>> - if the local NameNode is healthy, and the
|
|
|
+ ZKFC sees that no other node currently holds the lock znode, it will itself
|
|
|
+ try to acquire the lock. If it succeeds, then it has "won the election", and
|
|
|
+ is responsible for running a failover to make its local NameNode active. The
|
|
|
+ failover process is similar to the manual failover described above: first,
|
|
|
+ the previous active is fenced if necessary, and then the local NameNode
|
|
|
+ transitions to active state.
|
|
|
+
|
|
|
+ For more details on the design of automatic failover, refer to the design
|
|
|
+ document attached to HDFS-2185 on the Apache HDFS JIRA.
|
|
|
+
|
|
|
+** Deploying ZooKeeper
|
|
|
+
|
|
|
+ In a typical deployment, ZooKeeper daemons are configured to run on three or
|
|
|
+ five nodes. Since ZooKeeper itself has light resource requirements, it is
|
|
|
+ acceptable to collocate the ZooKeeper nodes on the same hardware as the HDFS
|
|
|
+ NameNode and Standby Node. Many operators choose to deploy the third ZooKeeper
|
|
|
+ process on the same node as the YARN ResourceManager. It is advisable to
|
|
|
+ configure the ZooKeeper nodes to store their data on separate disk drives from
|
|
|
+ the HDFS metadata for best performance and isolation.
|
|
|
+
|
|
|
+ The setup of ZooKeeper is out of scope for this document. We will assume that
|
|
|
+ you have set up a ZooKeeper cluster running on three or more nodes, and have
|
|
|
+ verified its correct operation by connecting using the ZK CLI.
|
|
|
+
|
|
|
+** Before you begin
|
|
|
+
|
|
|
+ Before you begin configuring automatic failover, you should shut down your
|
|
|
+ cluster. It is not currently possible to transition from a manual failover
|
|
|
+ setup to an automatic failover setup while the cluster is running.
|
|
|
+
|
|
|
+** Configuring automatic failover
|
|
|
+
|
|
|
+ The configuration of automatic failover requires the addition of two new
|
|
|
+ parameters to your configuration. In your <<<hdfs-site.xml>>> file, add:
|
|
|
+
|
|
|
+----
|
|
|
+ <property>
|
|
|
+ <name>dfs.ha.automatic-failover.enabled</name>
|
|
|
+ <value>true</value>
|
|
|
+ </property>
|
|
|
+----
|
|
|
+
|
|
|
+ This specifies that the cluster should be set up for automatic failover.
|
|
|
+ In your <<<core-site.xml>>> file, add:
|
|
|
+
|
|
|
+----
|
|
|
+ <property>
|
|
|
+ <name>ha.zookeeper.quorum</name>
|
|
|
+ <value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</value>
|
|
|
+ </property>
|
|
|
+----
|
|
|
+
|
|
|
+ This lists the host-port pairs running the ZooKeeper service.
|
|
|
+
|
|
|
+ As with the parameters described earlier in the document, these settings may
|
|
|
+ be configured on a per-nameservice basis by suffixing the configuration key
|
|
|
+ with the nameservice ID. For example, in a cluster with federation enabled,
|
|
|
+ you can explicitly enable automatic failover for only one of the nameservices
|
|
|
+ by setting <<<dfs.ha.automatic-failover.enabled.my-nameservice-id>>>.
|
|
|
+
|
|
|
+ There are also several other configuration parameters which may be set to
|
|
|
+ control the behavior of automatic failover; however, they are not necessary
|
|
|
+ for most installations. Please refer to the configuration key specific
|
|
|
+ documentation for details.
|
|
|
+
|
|
|
+** Initializing HA state in ZooKeeper
|
|
|
+
|
|
|
+ After the configuration keys have been added, the next step is to initialize
|
|
|
+ required state in ZooKeeper. You can do so by running the following command
|
|
|
+ from one of the NameNode hosts.
|
|
|
+
|
|
|
+----
|
|
|
+$ hdfs zkfc -formatZK
|
|
|
+----
|
|
|
+
|
|
|
+ This will create a znode in ZooKeeper inside of which the automatic failover
|
|
|
+ system stores its data.
|
|
|
+
|
|
|
+** Starting the cluster with <<<start-dfs.sh>>>
|
|
|
+
|
|
|
+ Since automatic failover has been enabled in the configuration, the
|
|
|
+ <<<start-dfs.sh>>> script will now automatically start a ZKFC daemon on any
|
|
|
+ machine that runs a NameNode. When the ZKFCs start, they will automatically
|
|
|
+ select one of the NameNodes to become active.
|
|
|
+
|
|
|
+** Starting the cluster manually
|
|
|
+
|
|
|
+ If you manually manage the services on your cluster, you will need to manually
|
|
|
+ start the <<<zkfc>>> daemon on each of the machines that runs a NameNode. You
|
|
|
+ can start the daemon by running:
|
|
|
+
|
|
|
+----
|
|
|
+$ hadoop-daemon.sh start zkfc
|
|
|
+----
|
|
|
+
|
|
|
+** Securing access to ZooKeeper
|
|
|
+
|
|
|
+ If you are running a secure cluster, you will likely want to ensure that the
|
|
|
+ information stored in ZooKeeper is also secured. This prevents malicious
|
|
|
+ clients from modifying the metadata in ZooKeeper or potentially triggering a
|
|
|
+ false failover.
|
|
|
+
|
|
|
+ In order to secure the information in ZooKeeper, first add the following to
|
|
|
+ your <<<core-site.xml>>> file:
|
|
|
+
|
|
|
+----
|
|
|
+ <property>
|
|
|
+ <name>ha.zookeeper.auth</name>
|
|
|
+ <value>@/path/to/zk-auth.txt</value>
|
|
|
+ </property>
|
|
|
+ <property>
|
|
|
+ <name>ha.zookeeper.acl</name>
|
|
|
+ <value>@/path/to/zk-acl.txt</value>
|
|
|
+ </property>
|
|
|
+----
|
|
|
+
|
|
|
+ Please note the '@' character in these values -- this specifies that the
|
|
|
+ configurations are not inline, but rather point to a file on disk.
|
|
|
+
|
|
|
+ The first configured file specifies a list of ZooKeeper authentications, in
|
|
|
+ the same format as used by the ZK CLI. For example, you may specify something
|
|
|
+ like:
|
|
|
+
|
|
|
+----
|
|
|
+digest:hdfs-zkfcs:mypassword
|
|
|
+----
|
|
|
+ ...where <<<hdfs-zkfcs>>> is a unique username for ZooKeeper, and
|
|
|
+ <<<mypassword>>> is some unique string used as a password.
|
|
|
+
|
|
|
+ Next, generate a ZooKeeper ACL that corresponds to this authentication, using
|
|
|
+ a command like the following:
|
|
|
+
|
|
|
+----
|
|
|
+$ java -cp $ZK_HOME/lib/*:$ZK_HOME/zookeeper-3.4.2.jar org.apache.zookeeper.server.auth.DigestAuthenticationProvider hdfs-zkfcs:mypassword
|
|
|
+output: hdfs-zkfcs:mypassword->hdfs-zkfcs:P/OQvnYyU/nF/mGYvB/xurX8dYs=
|
|
|
+----
|
|
|
+
|
|
|
+ Copy and paste the section of this output after the '->' string into the file
|
|
|
+ <<<zk-acls.txt>>>, prefixed by the string "<<<digest:>>>". For example:
|
|
|
+
|
|
|
+----
|
|
|
+digest:hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=:rwcda
|
|
|
+----
|
|
|
+
|
|
|
+ In order for these ACLs to take effect, you should then rerun the
|
|
|
+ <<<zkfc -formatZK>>> command as described above.
|
|
|
+
|
|
|
+ After doing so, you may verify the ACLs from the ZK CLI as follows:
|
|
|
+
|
|
|
+----
|
|
|
+[zk: localhost:2181(CONNECTED) 1] getAcl /hadoop-ha
|
|
|
+'digest,'hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=
|
|
|
+: cdrwa
|
|
|
+----
|
|
|
+
|
|
|
+** Verifying automatic failover
|
|
|
+
|
|
|
+ Once automatic failover has been set up, you should test its operation. To do
|
|
|
+ so, first locate the active NameNode. You can tell which node is active by
|
|
|
+ visiting the NameNode web interfaces -- each node reports its HA state at the
|
|
|
+ top of the page.
|
|
|
+
|
|
|
+ Once you have located your active NameNode, you may cause a failure on that
|
|
|
+ node. For example, you can use <<<kill -9 <pid of NN>>>> to simulate a JVM
|
|
|
+ crash. Or, you could power cycle the machine or unplug its network interface
|
|
|
+ to simulate a different kind of outage. After triggering the outage you wish
|
|
|
+ to test, the other NameNode should automatically become active within several
|
|
|
+ seconds. The amount of time required to detect a failure and trigger a
|
|
|
+ fail-over depends on the configuration of
|
|
|
+ <<<ha.zookeeper.session-timeout.ms>>>, but defaults to 5 seconds.
|
|
|
+
|
|
|
+ If the test does not succeed, you may have a misconfiguration. Check the logs
|
|
|
+ for the <<<zkfc>>> daemons as well as the NameNode daemons in order to further
|
|
|
+ diagnose the issue.
|
|
|
+
|
|
|
+
|
|
|
+* Automatic Failover FAQ
|
|
|
+
|
|
|
+ * <<Is it important that I start the ZKFC and NameNode daemons in any
|
|
|
+ particular order?>>
|
|
|
+
|
|
|
+ No. On any given node you may start the ZKFC before or after its corresponding
|
|
|
+ NameNode.
|
|
|
+
|
|
|
+ * <<What additional monitoring should I put in place?>>
|
|
|
+
|
|
|
+ You should add monitoring on each host that runs a NameNode to ensure that the
|
|
|
+ ZKFC remains running. In some types of ZooKeeper failures, for example, the
|
|
|
+ ZKFC may unexpectedly exit, and should be restarted to ensure that the system
|
|
|
+ is ready for automatic failover.
|
|
|
+
|
|
|
+ Additionally, you should monitor each of the servers in the ZooKeeper
|
|
|
+ quorum. If ZooKeeper crashes, then automatic failover will not function.
|
|
|
+
|
|
|
+ * <<What happens if ZooKeeper goes down?>>
|
|
|
+
|
|
|
+ If the ZooKeeper cluster crashes, no automatic failovers will be triggered.
|
|
|
+ However, HDFS will continue to run without any impact. When ZooKeeper is
|
|
|
+ restarted, HDFS will reconnect with no issues.
|
|
|
+
|
|
|
+ * <<Can I designate one of my NameNodes as primary/preferred?>>
|
|
|
+
|
|
|
+ No. Currently, this is not supported. Whichever NameNode is started first will
|
|
|
+ become active. You may choose to start the cluster in a specific order such
|
|
|
+ that your preferred node starts first.
|
|
|
+
|
|
|
+ * <<How can I initiate a manual failover when automatic failover is
|
|
|
+ configured?>>
|
|
|
+
|
|
|
+ Even if automatic failover is configured, you may initiate a manual failover
|
|
|
+ using the same <<<hdfs haadmin>>> command. It will perform a coordinated
|
|
|
+ failover.
|