HDFSHighAvailabilityWithNFS.apt.vm 36 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859
  1. ~~ Licensed under the Apache License, Version 2.0 (the "License");
  2. ~~ you may not use this file except in compliance with the License.
  3. ~~ You may obtain a copy of the License at
  4. ~~
  5. ~~ http://www.apache.org/licenses/LICENSE-2.0
  6. ~~
  7. ~~ Unless required by applicable law or agreed to in writing, software
  8. ~~ distributed under the License is distributed on an "AS IS" BASIS,
  9. ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  10. ~~ See the License for the specific language governing permissions and
  11. ~~ limitations under the License. See accompanying LICENSE file.
  12. ---
  13. Hadoop Distributed File System-${project.version} - High Availability
  14. ---
  15. ---
  16. ${maven.build.timestamp}
  17. HDFS High Availability
  18. %{toc|section=1|fromDepth=0}
  19. * {Purpose}
  20. This guide provides an overview of the HDFS High Availability (HA) feature and
  21. how to configure and manage an HA HDFS cluster, using NFS for the shared
  22. storage required by the NameNodes.
  23. This document assumes that the reader has a general understanding of
  24. general components and node types in an HDFS cluster. Please refer to the
  25. HDFS Architecture guide for details.
  26. * {Note: Using the Quorum Journal Manager or Conventional Shared Storage}
  27. This guide discusses how to configure and use HDFS HA using a shared NFS
  28. directory to share edit logs between the Active and Standby NameNodes. For
  29. information on how to configure HDFS HA using the Quorum Journal Manager
  30. instead of NFS, please see {{{./HDFSHighAvailabilityWithQJM.html}this
  31. alternative guide.}}
  32. * {Background}
  33. Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in
  34. an HDFS cluster. Each cluster had a single NameNode, and if that machine or
  35. process became unavailable, the cluster as a whole would be unavailable
  36. until the NameNode was either restarted or brought up on a separate machine.
  37. This impacted the total availability of the HDFS cluster in two major ways:
  38. * In the case of an unplanned event such as a machine crash, the cluster would
  39. be unavailable until an operator restarted the NameNode.
  40. * Planned maintenance events such as software or hardware upgrades on the
  41. NameNode machine would result in windows of cluster downtime.
  42. The HDFS High Availability feature addresses the above problems by providing
  43. the option of running two redundant NameNodes in the same cluster in an
  44. Active/Passive configuration with a hot standby. This allows a fast failover to
  45. a new NameNode in the case that a machine crashes, or a graceful
  46. administrator-initiated failover for the purpose of planned maintenance.
  47. * {Architecture}
  48. In a typical HA cluster, two separate machines are configured as NameNodes.
  49. At any point in time, exactly one of the NameNodes is in an <Active> state,
  50. and the other is in a <Standby> state. The Active NameNode is responsible
  51. for all client operations in the cluster, while the Standby is simply acting
  52. as a slave, maintaining enough state to provide a fast failover if
  53. necessary.
  54. In order for the Standby node to keep its state synchronized with the Active
  55. node, the current implementation requires that the two nodes both have access
  56. to a directory on a shared storage device (eg an NFS mount from a NAS). This
  57. restriction will likely be relaxed in future versions.
  58. When any namespace modification is performed by the Active node, it durably
  59. logs a record of the modification to an edit log file stored in the shared
  60. directory. The Standby node is constantly watching this directory for edits,
  61. and as it sees the edits, it applies them to its own namespace. In the event of
  62. a failover, the Standby will ensure that it has read all of the edits from the
  63. shared storage before promoting itself to the Active state. This ensures that
  64. the namespace state is fully synchronized before a failover occurs.
  65. In order to provide a fast failover, it is also necessary that the Standby node
  66. have up-to-date information regarding the location of blocks in the cluster.
  67. In order to achieve this, the DataNodes are configured with the location of
  68. both NameNodes, and send block location information and heartbeats to both.
  69. It is vital for the correct operation of an HA cluster that only one of the
  70. NameNodes be Active at a time. Otherwise, the namespace state would quickly
  71. diverge between the two, risking data loss or other incorrect results. In
  72. order to ensure this property and prevent the so-called "split-brain scenario,"
  73. the administrator must configure at least one <fencing method> for the shared
  74. storage. During a failover, if it cannot be verified that the previous Active
  75. node has relinquished its Active state, the fencing process is responsible for
  76. cutting off the previous Active's access to the shared edits storage. This
  77. prevents it from making any further edits to the namespace, allowing the new
  78. Active to safely proceed with failover.
  79. * {Hardware resources}
  80. In order to deploy an HA cluster, you should prepare the following:
  81. * <<NameNode machines>> - the machines on which you run the Active and
  82. Standby NameNodes should have equivalent hardware to each other, and
  83. equivalent hardware to what would be used in a non-HA cluster.
  84. * <<Shared storage>> - you will need to have a shared directory which both
  85. NameNode machines can have read/write access to. Typically this is a remote
  86. filer which supports NFS and is mounted on each of the NameNode machines.
  87. Currently only a single shared edits directory is supported. Thus, the
  88. availability of the system is limited by the availability of this shared edits
  89. directory, and therefore in order to remove all single points of failure there
  90. needs to be redundancy for the shared edits directory. Specifically, multiple
  91. network paths to the storage, and redundancy in the storage itself (disk,
  92. network, and power). Beacuse of this, it is recommended that the shared storage
  93. server be a high-quality dedicated NAS appliance rather than a simple Linux
  94. server.
  95. Note that, in an HA cluster, the Standby NameNode also performs checkpoints of
  96. the namespace state, and thus it is not necessary to run a Secondary NameNode,
  97. CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an
  98. error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster
  99. to be HA-enabled to reuse the hardware which they had previously dedicated to
  100. the Secondary NameNode.
  101. * {Deployment}
  102. ** Configuration overview
  103. Similar to Federation configuration, HA configuration is backward compatible
  104. and allows existing single NameNode configurations to work without change.
  105. The new configuration is designed such that all the nodes in the cluster may
  106. have the same configuration without the need for deploying different
  107. configuration files to different machines based on the type of the node.
  108. Like HDFS Federation, HA clusters reuse the <<<nameservice ID>>> to identify a
  109. single HDFS instance that may in fact consist of multiple HA NameNodes. In
  110. addition, a new abstraction called <<<NameNode ID>>> is added with HA. Each
  111. distinct NameNode in the cluster has a different NameNode ID to distinguish it.
  112. To support a single configuration file for all of the NameNodes, the relevant
  113. configuration parameters are suffixed with the <<nameservice ID>> as well as
  114. the <<NameNode ID>>.
  115. ** Configuration details
  116. To configure HA NameNodes, you must add several configuration options to your
  117. <<hdfs-site.xml>> configuration file.
  118. The order in which you set these configurations is unimportant, but the values
  119. you choose for <<dfs.nameservices>> and
  120. <<dfs.ha.namenodes.[nameservice ID]>> will determine the keys of those that
  121. follow. Thus, you should decide on these values before setting the rest of the
  122. configuration options.
  123. * <<dfs.nameservices>> - the logical name for this new nameservice
  124. Choose a logical name for this nameservice, for example "mycluster", and use
  125. this logical name for the value of this config option. The name you choose is
  126. arbitrary. It will be used both for configuration and as the authority
  127. component of absolute HDFS paths in the cluster.
  128. <<Note:>> If you are also using HDFS Federation, this configuration setting
  129. should also include the list of other nameservices, HA or otherwise, as a
  130. comma-separated list.
  131. ----
  132. <property>
  133. <name>dfs.nameservices</name>
  134. <value>mycluster</value>
  135. </property>
  136. ----
  137. * <<dfs.ha.namenodes.[nameservice ID]>> - unique identifiers for each NameNode in the nameservice
  138. Configure with a list of comma-separated NameNode IDs. This will be used by
  139. DataNodes to determine all the NameNodes in the cluster. For example, if you
  140. used "mycluster" as the nameservice ID previously, and you wanted to use "nn1"
  141. and "nn2" as the individual IDs of the NameNodes, you would configure this as
  142. such:
  143. ----
  144. <property>
  145. <name>dfs.ha.namenodes.mycluster</name>
  146. <value>nn1,nn2</value>
  147. </property>
  148. ----
  149. <<Note:>> Currently, only a maximum of two NameNodes may be configured per
  150. nameservice.
  151. * <<dfs.namenode.rpc-address.[nameservice ID].[name node ID]>> - the fully-qualified RPC address for each NameNode to listen on
  152. For both of the previously-configured NameNode IDs, set the full address and
  153. IPC port of the NameNode processs. Note that this results in two separate
  154. configuration options. For example:
  155. ----
  156. <property>
  157. <name>dfs.namenode.rpc-address.mycluster.nn1</name>
  158. <value>machine1.example.com:8020</value>
  159. </property>
  160. <property>
  161. <name>dfs.namenode.rpc-address.mycluster.nn2</name>
  162. <value>machine2.example.com:8020</value>
  163. </property>
  164. ----
  165. <<Note:>> You may similarly configure the "<<servicerpc-address>>" setting if
  166. you so desire.
  167. * <<dfs.namenode.http-address.[nameservice ID].[name node ID]>> - the fully-qualified HTTP address for each NameNode to listen on
  168. Similarly to <rpc-address> above, set the addresses for both NameNodes' HTTP
  169. servers to listen on. For example:
  170. ----
  171. <property>
  172. <name>dfs.namenode.http-address.mycluster.nn1</name>
  173. <value>machine1.example.com:50070</value>
  174. </property>
  175. <property>
  176. <name>dfs.namenode.http-address.mycluster.nn2</name>
  177. <value>machine2.example.com:50070</value>
  178. </property>
  179. ----
  180. <<Note:>> If you have Hadoop's security features enabled, you should also set
  181. the <https-address> similarly for each NameNode.
  182. * <<dfs.namenode.shared.edits.dir>> - the location of the shared storage directory
  183. This is where one configures the path to the remote shared edits directory
  184. which the Standby NameNode uses to stay up-to-date with all the file system
  185. changes the Active NameNode makes. <<You should only configure one of these
  186. directories.>> This directory should be mounted r/w on both NameNode machines.
  187. The value of this setting should be the absolute path to this directory on the
  188. NameNode machines. For example:
  189. ----
  190. <property>
  191. <name>dfs.namenode.shared.edits.dir</name>
  192. <value>file:///mnt/filer1/dfs/ha-name-dir-shared</value>
  193. </property>
  194. ----
  195. * <<dfs.client.failover.proxy.provider.[nameservice ID]>> - the Java class that HDFS clients use to contact the Active NameNode
  196. Configure the name of the Java class which will be used by the DFS Client to
  197. determine which NameNode is the current Active, and therefore which NameNode is
  198. currently serving client requests. The only implementation which currently
  199. ships with Hadoop is the <<ConfiguredFailoverProxyProvider>>, so use this
  200. unless you are using a custom one. For example:
  201. ----
  202. <property>
  203. <name>dfs.client.failover.proxy.provider.mycluster</name>
  204. <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
  205. </property>
  206. ----
  207. * <<dfs.ha.fencing.methods>> - a list of scripts or Java classes which will be used to fence the Active NameNode during a failover
  208. It is critical for correctness of the system that only one NameNode be in the
  209. Active state at any given time. Thus, during a failover, we first ensure that
  210. the Active NameNode is either in the Standby state, or the process has
  211. terminated, before transitioning the other NameNode to the Active state. In
  212. order to do this, you must configure at least one <<fencing method.>> These are
  213. configured as a carriage-return-separated list, which will be attempted in order
  214. until one indicates that fencing has succeeded. There are two methods which
  215. ship with Hadoop: <shell> and <sshfence>. For information on implementing
  216. your own custom fencing method, see the <org.apache.hadoop.ha.NodeFencer> class.
  217. * <<sshfence>> - SSH to the Active NameNode and kill the process
  218. The <sshfence> option SSHes to the target node and uses <fuser> to kill the
  219. process listening on the service's TCP port. In order for this fencing option
  220. to work, it must be able to SSH to the target node without providing a
  221. passphrase. Thus, one must also configure the
  222. <<dfs.ha.fencing.ssh.private-key-files>> option, which is a
  223. comma-separated list of SSH private key files. For example:
  224. ---
  225. <property>
  226. <name>dfs.ha.fencing.methods</name>
  227. <value>sshfence</value>
  228. </property>
  229. <property>
  230. <name>dfs.ha.fencing.ssh.private-key-files</name>
  231. <value>/home/exampleuser/.ssh/id_rsa</value>
  232. </property>
  233. ---
  234. Optionally, one may configure a non-standard username or port to perform the
  235. SSH. One may also configure a timeout, in milliseconds, for the SSH, after
  236. which this fencing method will be considered to have failed. It may be
  237. configured like so:
  238. ---
  239. <property>
  240. <name>dfs.ha.fencing.methods</name>
  241. <value>sshfence([[username][:port]])</value>
  242. </property>
  243. <property>
  244. <name>dfs.ha.fencing.ssh.connect-timeout</name>
  245. <value>30000</value>
  246. </property>
  247. ---
  248. * <<shell>> - run an arbitrary shell command to fence the Active NameNode
  249. The <shell> fencing method runs an arbitrary shell command. It may be
  250. configured like so:
  251. ---
  252. <property>
  253. <name>dfs.ha.fencing.methods</name>
  254. <value>shell(/path/to/my/script.sh arg1 arg2 ...)</value>
  255. </property>
  256. ---
  257. The string between '(' and ')' is passed directly to a bash shell and may not
  258. include any closing parentheses.
  259. The shell command will be run with an environment set up to contain all of the
  260. current Hadoop configuration variables, with the '_' character replacing any
  261. '.' characters in the configuration keys. The configuration used has already had
  262. any namenode-specific configurations promoted to their generic forms -- for example
  263. <<dfs_namenode_rpc-address>> will contain the RPC address of the target node, even
  264. though the configuration may specify that variable as
  265. <<dfs.namenode.rpc-address.ns1.nn1>>.
  266. Additionally, the following variables referring to the target node to be fenced
  267. are also available:
  268. *-----------------------:-----------------------------------+
  269. | $target_host | hostname of the node to be fenced |
  270. *-----------------------:-----------------------------------+
  271. | $target_port | IPC port of the node to be fenced |
  272. *-----------------------:-----------------------------------+
  273. | $target_address | the above two, combined as host:port |
  274. *-----------------------:-----------------------------------+
  275. | $target_nameserviceid | the nameservice ID of the NN to be fenced |
  276. *-----------------------:-----------------------------------+
  277. | $target_namenodeid | the namenode ID of the NN to be fenced |
  278. *-----------------------:-----------------------------------+
  279. These environment variables may also be used as substitutions in the shell
  280. command itself. For example:
  281. ---
  282. <property>
  283. <name>dfs.ha.fencing.methods</name>
  284. <value>shell(/path/to/my/script.sh --nameservice=$target_nameserviceid $target_host:$target_port)</value>
  285. </property>
  286. ---
  287. If the shell command returns an exit
  288. code of 0, the fencing is determined to be successful. If it returns any other
  289. exit code, the fencing was not successful and the next fencing method in the
  290. list will be attempted.
  291. <<Note:>> This fencing method does not implement any timeout. If timeouts are
  292. necessary, they should be implemented in the shell script itself (eg by forking
  293. a subshell to kill its parent in some number of seconds).
  294. * <<fs.defaultFS>> - the default path prefix used by the Hadoop FS client when none is given
  295. Optionally, you may now configure the default path for Hadoop clients to use
  296. the new HA-enabled logical URI. If you used "mycluster" as the nameservice ID
  297. earlier, this will be the value of the authority portion of all of your HDFS
  298. paths. This may be configured like so, in your <<core-site.xml>> file:
  299. ---
  300. <property>
  301. <name>fs.defaultFS</name>
  302. <value>hdfs://mycluster</value>
  303. </property>
  304. ---
  305. ** Deployment details
  306. After all of the necessary configuration options have been set, one must
  307. initially synchronize the two HA NameNodes' on-disk metadata.
  308. * If you are setting up a fresh HDFS cluster, you should first run the format
  309. command (<hdfs namenode -format>) on one of NameNodes.
  310. * If you have already formatted the NameNode, or are converting a
  311. non-HA-enabled cluster to be HA-enabled, you should now copy over the
  312. contents of your NameNode metadata directories to the other, unformatted
  313. NameNode by running the command "<hdfs namenode -bootstrapStandby>" on the
  314. unformatted NameNode. Running this command will also ensure that the shared
  315. edits directory (as configured by <<dfs.namenode.shared.edits.dir>>) contains
  316. sufficient edits transactions to be able to start both NameNodes.
  317. * If you are converting a non-HA NameNode to be HA, you should run the
  318. command "<hdfs -initializeSharedEdits>", which will initialize the shared
  319. edits directory with the edits data from the local NameNode edits directories.
  320. At this point you may start both of your HA NameNodes as you normally would
  321. start a NameNode.
  322. You can visit each of the NameNodes' web pages separately by browsing to their
  323. configured HTTP addresses. You should notice that next to the configured
  324. address will be the HA state of the NameNode (either "standby" or "active".)
  325. Whenever an HA NameNode starts, it is initially in the Standby state.
  326. ** Administrative commands
  327. Now that your HA NameNodes are configured and started, you will have access
  328. to some additional commands to administer your HA HDFS cluster. Specifically,
  329. you should familiarize yourself with all of the subcommands of the "<hdfs
  330. haadmin>" command. Running this command without any additional arguments will
  331. display the following usage information:
  332. ---
  333. Usage: DFSHAAdmin [-ns <nameserviceId>]
  334. [-transitionToActive <serviceId>]
  335. [-transitionToStandby <serviceId>]
  336. [-failover [--forcefence] [--forceactive] <serviceId> <serviceId>]
  337. [-getServiceState <serviceId>]
  338. [-checkHealth <serviceId>]
  339. [-help <command>]
  340. ---
  341. This guide describes high-level uses of each of these subcommands. For
  342. specific usage information of each subcommand, you should run "<hdfs haadmin
  343. -help <command>>".
  344. * <<transitionToActive>> and <<transitionToStandby>> - transition the state of the given NameNode to Active or Standby
  345. These subcommands cause a given NameNode to transition to the Active or Standby
  346. state, respectively. <<These commands do not attempt to perform any fencing,
  347. and thus should rarely be used.>> Instead, one should almost always prefer to
  348. use the "<hdfs haadmin -failover>" subcommand.
  349. * <<failover>> - initiate a failover between two NameNodes
  350. This subcommand causes a failover from the first provided NameNode to the
  351. second. If the first NameNode is in the Standby state, this command simply
  352. transitions the second to the Active state without error. If the first NameNode
  353. is in the Active state, an attempt will be made to gracefully transition it to
  354. the Standby state. If this fails, the fencing methods (as configured by
  355. <<dfs.ha.fencing.methods>>) will be attempted in order until one
  356. succeeds. Only after this process will the second NameNode be transitioned to
  357. the Active state. If no fencing method succeeds, the second NameNode will not
  358. be transitioned to the Active state, and an error will be returned.
  359. * <<getServiceState>> - determine whether the given NameNode is Active or Standby
  360. Connect to the provided NameNode to determine its current state, printing
  361. either "standby" or "active" to STDOUT appropriately. This subcommand might be
  362. used by cron jobs or monitoring scripts which need to behave differently based
  363. on whether the NameNode is currently Active or Standby.
  364. * <<checkHealth>> - check the health of the given NameNode
  365. Connect to the provided NameNode to check its health. The NameNode is capable
  366. of performing some diagnostics on itself, including checking if internal
  367. services are running as expected. This command will return 0 if the NameNode is
  368. healthy, non-zero otherwise. One might use this command for monitoring
  369. purposes.
  370. <<Note:>> This is not yet implemented, and at present will always return
  371. success, unless the given NameNode is completely down.
  372. * {Automatic Failover}
  373. ** Introduction
  374. The above sections describe how to configure manual failover. In that mode,
  375. the system will not automatically trigger a failover from the active to the
  376. standby NameNode, even if the active node has failed. This section describes
  377. how to configure and deploy automatic failover.
  378. ** Components
  379. Automatic failover adds two new components to an HDFS deployment: a ZooKeeper
  380. quorum, and the ZKFailoverController process (abbreviated as ZKFC).
  381. Apache ZooKeeper is a highly available service for maintaining small amounts
  382. of coordination data, notifying clients of changes in that data, and
  383. monitoring clients for failures. The implementation of automatic HDFS failover
  384. relies on ZooKeeper for the following things:
  385. * <<Failure detection>> - each of the NameNode machines in the cluster
  386. maintains a persistent session in ZooKeeper. If the machine crashes, the
  387. ZooKeeper session will expire, notifying the other NameNode that a failover
  388. should be triggered.
  389. * <<Active NameNode election>> - ZooKeeper provides a simple mechanism to
  390. exclusively elect a node as active. If the current active NameNode crashes,
  391. another node may take a special exclusive lock in ZooKeeper indicating that
  392. it should become the next active.
  393. The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client
  394. which also monitors and manages the state of the NameNode. Each of the
  395. machines which runs a NameNode also runs a ZKFC, and that ZKFC is responsible
  396. for:
  397. * <<Health monitoring>> - the ZKFC pings its local NameNode on a periodic
  398. basis with a health-check command. So long as the NameNode responds in a
  399. timely fashion with a healthy status, the ZKFC considers the node
  400. healthy. If the node has crashed, frozen, or otherwise entered an unhealthy
  401. state, the health monitor will mark it as unhealthy.
  402. * <<ZooKeeper session management>> - when the local NameNode is healthy, the
  403. ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it
  404. also holds a special "lock" znode. This lock uses ZooKeeper's support for
  405. "ephemeral" nodes; if the session expires, the lock node will be
  406. automatically deleted.
  407. * <<ZooKeeper-based election>> - if the local NameNode is healthy, and the
  408. ZKFC sees that no other node currently holds the lock znode, it will itself
  409. try to acquire the lock. If it succeeds, then it has "won the election", and
  410. is responsible for running a failover to make its local NameNode active. The
  411. failover process is similar to the manual failover described above: first,
  412. the previous active is fenced if necessary, and then the local NameNode
  413. transitions to active state.
  414. For more details on the design of automatic failover, refer to the design
  415. document attached to HDFS-2185 on the Apache HDFS JIRA.
  416. ** Deploying ZooKeeper
  417. In a typical deployment, ZooKeeper daemons are configured to run on three or
  418. five nodes. Since ZooKeeper itself has light resource requirements, it is
  419. acceptable to collocate the ZooKeeper nodes on the same hardware as the HDFS
  420. NameNode and Standby Node. Many operators choose to deploy the third ZooKeeper
  421. process on the same node as the YARN ResourceManager. It is advisable to
  422. configure the ZooKeeper nodes to store their data on separate disk drives from
  423. the HDFS metadata for best performance and isolation.
  424. The setup of ZooKeeper is out of scope for this document. We will assume that
  425. you have set up a ZooKeeper cluster running on three or more nodes, and have
  426. verified its correct operation by connecting using the ZK CLI.
  427. ** Before you begin
  428. Before you begin configuring automatic failover, you should shut down your
  429. cluster. It is not currently possible to transition from a manual failover
  430. setup to an automatic failover setup while the cluster is running.
  431. ** Configuring automatic failover
  432. The configuration of automatic failover requires the addition of two new
  433. parameters to your configuration. In your <<<hdfs-site.xml>>> file, add:
  434. ----
  435. <property>
  436. <name>dfs.ha.automatic-failover.enabled</name>
  437. <value>true</value>
  438. </property>
  439. ----
  440. This specifies that the cluster should be set up for automatic failover.
  441. In your <<<core-site.xml>>> file, add:
  442. ----
  443. <property>
  444. <name>ha.zookeeper.quorum</name>
  445. <value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</value>
  446. </property>
  447. ----
  448. This lists the host-port pairs running the ZooKeeper service.
  449. As with the parameters described earlier in the document, these settings may
  450. be configured on a per-nameservice basis by suffixing the configuration key
  451. with the nameservice ID. For example, in a cluster with federation enabled,
  452. you can explicitly enable automatic failover for only one of the nameservices
  453. by setting <<<dfs.ha.automatic-failover.enabled.my-nameservice-id>>>.
  454. There are also several other configuration parameters which may be set to
  455. control the behavior of automatic failover; however, they are not necessary
  456. for most installations. Please refer to the configuration key specific
  457. documentation for details.
  458. ** Initializing HA state in ZooKeeper
  459. After the configuration keys have been added, the next step is to initialize
  460. required state in ZooKeeper. You can do so by running the following command
  461. from one of the NameNode hosts.
  462. ----
  463. [hdfs]$ $HADOOP_PREFIX/bin/zkfc -formatZK
  464. ----
  465. This will create a znode in ZooKeeper inside of which the automatic failover
  466. system stores its data.
  467. ** Starting the cluster with <<<start-dfs.sh>>>
  468. Since automatic failover has been enabled in the configuration, the
  469. <<<start-dfs.sh>>> script will now automatically start a ZKFC daemon on any
  470. machine that runs a NameNode. When the ZKFCs start, they will automatically
  471. select one of the NameNodes to become active.
  472. ** Starting the cluster manually
  473. If you manually manage the services on your cluster, you will need to manually
  474. start the <<<zkfc>>> daemon on each of the machines that runs a NameNode. You
  475. can start the daemon by running:
  476. ----
  477. [hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon start zkfc
  478. ----
  479. ** Securing access to ZooKeeper
  480. If you are running a secure cluster, you will likely want to ensure that the
  481. information stored in ZooKeeper is also secured. This prevents malicious
  482. clients from modifying the metadata in ZooKeeper or potentially triggering a
  483. false failover.
  484. In order to secure the information in ZooKeeper, first add the following to
  485. your <<<core-site.xml>>> file:
  486. ----
  487. <property>
  488. <name>ha.zookeeper.auth</name>
  489. <value>@/path/to/zk-auth.txt</value>
  490. </property>
  491. <property>
  492. <name>ha.zookeeper.acl</name>
  493. <value>@/path/to/zk-acl.txt</value>
  494. </property>
  495. ----
  496. Please note the '@' character in these values -- this specifies that the
  497. configurations are not inline, but rather point to a file on disk.
  498. The first configured file specifies a list of ZooKeeper authentications, in
  499. the same format as used by the ZK CLI. For example, you may specify something
  500. like:
  501. ----
  502. digest:hdfs-zkfcs:mypassword
  503. ----
  504. ...where <<<hdfs-zkfcs>>> is a unique username for ZooKeeper, and
  505. <<<mypassword>>> is some unique string used as a password.
  506. Next, generate a ZooKeeper ACL that corresponds to this authentication, using
  507. a command like the following:
  508. ----
  509. [hdfs]$ java -cp $ZK_HOME/lib/*:$ZK_HOME/zookeeper-3.4.2.jar org.apache.zookeeper.server.auth.DigestAuthenticationProvider hdfs-zkfcs:mypassword
  510. output: hdfs-zkfcs:mypassword->hdfs-zkfcs:P/OQvnYyU/nF/mGYvB/xurX8dYs=
  511. ----
  512. Copy and paste the section of this output after the '->' string into the file
  513. <<<zk-acls.txt>>>, prefixed by the string "<<<digest:>>>". For example:
  514. ----
  515. digest:hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=:rwcda
  516. ----
  517. In order for these ACLs to take effect, you should then rerun the
  518. <<<zkfc -formatZK>>> command as described above.
  519. After doing so, you may verify the ACLs from the ZK CLI as follows:
  520. ----
  521. [zk: localhost:2181(CONNECTED) 1] getAcl /hadoop-ha
  522. 'digest,'hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=
  523. : cdrwa
  524. ----
  525. ** Verifying automatic failover
  526. Once automatic failover has been set up, you should test its operation. To do
  527. so, first locate the active NameNode. You can tell which node is active by
  528. visiting the NameNode web interfaces -- each node reports its HA state at the
  529. top of the page.
  530. Once you have located your active NameNode, you may cause a failure on that
  531. node. For example, you can use <<<kill -9 <pid of NN>>>> to simulate a JVM
  532. crash. Or, you could power cycle the machine or unplug its network interface
  533. to simulate a different kind of outage. After triggering the outage you wish
  534. to test, the other NameNode should automatically become active within several
  535. seconds. The amount of time required to detect a failure and trigger a
  536. fail-over depends on the configuration of
  537. <<<ha.zookeeper.session-timeout.ms>>>, but defaults to 5 seconds.
  538. If the test does not succeed, you may have a misconfiguration. Check the logs
  539. for the <<<zkfc>>> daemons as well as the NameNode daemons in order to further
  540. diagnose the issue.
  541. * Automatic Failover FAQ
  542. * <<Is it important that I start the ZKFC and NameNode daemons in any
  543. particular order?>>
  544. No. On any given node you may start the ZKFC before or after its corresponding
  545. NameNode.
  546. * <<What additional monitoring should I put in place?>>
  547. You should add monitoring on each host that runs a NameNode to ensure that the
  548. ZKFC remains running. In some types of ZooKeeper failures, for example, the
  549. ZKFC may unexpectedly exit, and should be restarted to ensure that the system
  550. is ready for automatic failover.
  551. Additionally, you should monitor each of the servers in the ZooKeeper
  552. quorum. If ZooKeeper crashes, then automatic failover will not function.
  553. * <<What happens if ZooKeeper goes down?>>
  554. If the ZooKeeper cluster crashes, no automatic failovers will be triggered.
  555. However, HDFS will continue to run without any impact. When ZooKeeper is
  556. restarted, HDFS will reconnect with no issues.
  557. * <<Can I designate one of my NameNodes as primary/preferred?>>
  558. No. Currently, this is not supported. Whichever NameNode is started first will
  559. become active. You may choose to start the cluster in a specific order such
  560. that your preferred node starts first.
  561. * <<How can I initiate a manual failover when automatic failover is
  562. configured?>>
  563. Even if automatic failover is configured, you may initiate a manual failover
  564. using the same <<<hdfs haadmin>>> command. It will perform a coordinated
  565. failover.
  566. * BookKeeper as a Shared storage (EXPERIMENTAL)
  567. One option for shared storage for the NameNode is BookKeeper.
  568. BookKeeper achieves high availability and strong durability guarantees by replicating
  569. edit log entries across multiple storage nodes. The edit log can be striped across
  570. the storage nodes for high performance. Fencing is supported in the protocol, i.e,
  571. BookKeeper will not allow two writers to write the single edit log.
  572. The meta data for BookKeeper is stored in ZooKeeper.
  573. In current HA architecture, a Zookeeper cluster is required for ZKFC. The same cluster can be
  574. for BookKeeper metadata.
  575. For more details on building a BookKeeper cluster, please refer to the
  576. {{{http://zookeeper.apache.org/bookkeeper/docs/trunk/bookkeeperConfig.html }BookKeeper documentation}}
  577. The BookKeeperJournalManager is an implementation of the HDFS JournalManager interface, which allows custom write ahead logging implementations to be plugged into the HDFS NameNode.
  578. **<<BookKeeper Journal Manager>>
  579. To use BookKeeperJournalManager, add the following to hdfs-site.xml.
  580. ----
  581. <property>
  582. <name>dfs.namenode.shared.edits.dir</name>
  583. <value>bookkeeper://zk1:2181;zk2:2181;zk3:2181/hdfsjournal</value>
  584. </property>
  585. <property>
  586. <name>dfs.namenode.edits.journal-plugin.bookkeeper</name>
  587. <value>org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager</value>
  588. </property>
  589. ----
  590. The URI format for bookkeeper is <<<bookkeeper://[zkEnsemble]/[rootZnode]
  591. [zookkeeper ensemble]>>> is a list of semi-colon separated, zookeeper host:port
  592. pairs. In the example above there are 3 servers, in the ensemble,
  593. zk1, zk2 & zk3, each one listening on port 2181.
  594. <<<[root znode]>>> is the path of the zookeeper znode, under which the edit log
  595. information will be stored.
  596. The class specified for the journal-plugin must be available in the NameNode's
  597. classpath. We explain how to generate a jar file with the journal manager and
  598. its dependencies, and how to put it into the classpath below.
  599. *** <<More configuration options>>
  600. * <<dfs.namenode.bookkeeperjournal.output-buffer-size>> -
  601. Number of bytes a bookkeeper journal stream will buffer before
  602. forcing a flush. Default is 1024.
  603. ----
  604. <property>
  605. <name>dfs.namenode.bookkeeperjournal.output-buffer-size</name>
  606. <value>1024</value>
  607. </property>
  608. ----
  609. * <<dfs.namenode.bookkeeperjournal.ensemble-size>> -
  610. Number of bookkeeper servers in edit log ensembles. This
  611. is the number of bookkeeper servers which need to be available
  612. for the edit log to be writable. Default is 3.
  613. ----
  614. <property>
  615. <name>dfs.namenode.bookkeeperjournal.ensemble-size</name>
  616. <value>3</value>
  617. </property>
  618. ----
  619. * <<dfs.namenode.bookkeeperjournal.quorum-size>> -
  620. Number of bookkeeper servers in the write quorum. This is the
  621. number of bookkeeper servers which must have acknowledged the
  622. write of an entry before it is considered written. Default is 2.
  623. ----
  624. <property>
  625. <name>dfs.namenode.bookkeeperjournal.quorum-size</name>
  626. <value>2</value>
  627. </property>
  628. ----
  629. * <<dfs.namenode.bookkeeperjournal.digestPw>> -
  630. Password to use when creating edit log segments.
  631. ----
  632. <property>
  633. <name>dfs.namenode.bookkeeperjournal.digestPw</name>
  634. <value>myPassword</value>
  635. </property>
  636. ----
  637. * <<dfs.namenode.bookkeeperjournal.zk.session.timeout>> -
  638. Session timeout for Zookeeper client from BookKeeper Journal Manager.
  639. Hadoop recommends that this value should be less than the ZKFC
  640. session timeout value. Default value is 3000.
  641. ----
  642. <property>
  643. <name>dfs.namenode.bookkeeperjournal.zk.session.timeout</name>
  644. <value>3000</value>
  645. </property>
  646. ----
  647. *** <<Building BookKeeper Journal Manager plugin jar>>
  648. To generate the distribution packages for BK journal, do the
  649. following.
  650. $ mvn clean package -Pdist
  651. This will generate a jar with the BookKeeperJournalManager,
  652. hadoop-hdfs/src/contrib/bkjournal/target/hadoop-hdfs-bkjournal-<VERSION>.jar
  653. Note that the -Pdist part of the build command is important, this would
  654. copy the dependent bookkeeper-server jar under
  655. hadoop-hdfs/src/contrib/bkjournal/target/lib.
  656. *** <<Putting the BookKeeperJournalManager in the NameNode classpath>>
  657. To run a HDFS namenode using BookKeeper as a backend, copy the bkjournal and
  658. bookkeeper-server jar, mentioned above, into the lib directory of hdfs. In the
  659. standard distribution of HDFS, this is at $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/
  660. cp hadoop-hdfs/src/contrib/bkjournal/target/hadoop-hdfs-bkjournal-<VERSION>.jar $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/
  661. *** <<Current limitations>>
  662. 1) Security in BookKeeper. BookKeeper does not support SASL nor SSL for
  663. connections between the NameNode and BookKeeper storage nodes.