|
@@ -0,0 +1,167 @@
|
|
|
|
+<!--
|
|
|
|
+ Licensed to the Apache Software Foundation (ASF) under one or more
|
|
|
|
+ contributor license agreements. See the NOTICE file distributed with
|
|
|
|
+ this work for additional information regarding copyright ownership.
|
|
|
|
+ The ASF licenses this file to You under the Apache License, Version 2.0
|
|
|
|
+ (the "License"); you may not use this file except in compliance with
|
|
|
|
+ the License. You may obtain a copy of the License at
|
|
|
|
+
|
|
|
|
+ http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
+
|
|
|
|
+ Unless required by applicable law or agreed to in writing, software
|
|
|
|
+ distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
+ See the License for the specific language governing permissions and
|
|
|
|
+ limitations under the License.
|
|
|
|
+-->
|
|
|
|
+
|
|
|
|
+HDFS Upgrade Domain
|
|
|
|
+====================
|
|
|
|
+
|
|
|
|
+<!-- MACRO{toc|fromDepth=0|toDepth=3} -->
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+Introduction
|
|
|
|
+------------
|
|
|
|
+
|
|
|
|
+The current default HDFS block placement policy guarantees that a block’s 3 replicas will be placed
|
|
|
|
+on at least 2 racks. Specifically one replica is placed on one rack and the other two replicas
|
|
|
|
+are placed on another rack during write pipeline. This is a good compromise between rack diversity and write-pipeline efficiency. Note that
|
|
|
|
+subsequent load balancing or machine membership change might cause 3 replicas of a block to be distributed
|
|
|
|
+across 3 different racks. Thus any 3 datanodes in different racks could store 3 replicas of a block.
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+However, the default placement policy impacts how we should perform datanode rolling upgrade.
|
|
|
|
+[HDFS Rolling Upgrade document](./HdfsRollingUpgrade.html) explains how the datanodes can be upgraded in a rolling
|
|
|
|
+fashion without downtime. Because any 3 datanodes in different racks could store all the replicas of a block, it is
|
|
|
|
+important to perform sequential restart of datanodes one at a time in order to minimize the impact on data availability
|
|
|
|
+and read/write operations. Upgrading one rack at a time is another option; but that will increase the chance of
|
|
|
|
+data unavailability if there is machine failure at another rack during the upgrade.
|
|
|
|
+
|
|
|
|
+The side effect of this sequential datanode rolling upgrade strategy is longer
|
|
|
|
+upgrade duration for larger clusters.
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+Architecture
|
|
|
|
+-------
|
|
|
|
+
|
|
|
|
+To address the limitation of block placement policy on rolling upgrade, the concept of upgrade domain
|
|
|
|
+has been added to HDFS via a new block placement policy. The idea is to group datanodes in a new
|
|
|
|
+dimension called upgrade domain, in addition to the existing rack-based grouping.
|
|
|
|
+For example, we can assign all datanodes in the first position of any rack to upgrade domain ud_01,
|
|
|
|
+nodes in the second position to upgrade domain ud_02 and so on.
|
|
|
|
+
|
|
|
|
+The namenode provides BlockPlacementPolicy interface to support any custom block placement besides
|
|
|
|
+the default block placement policy. A new upgrade domain block placement policy based on this interface
|
|
|
|
+is available in HDFS. It will make sure replicas of any given block are distributed across machines from different upgrade domains.
|
|
|
|
+By default, 3 replicas of any given block are placed on 3 different upgrade domains. This means all datanodes belonging to
|
|
|
|
+a specific upgrade domain collectively won't store more than one replica of any block.
|
|
|
|
+
|
|
|
|
+With upgrade domain block placement policy in place, we can upgrade all datanodes belonging to one upgrade domain at the
|
|
|
|
+same time without impacting data availability. Only after finishing upgrading one upgrade domain we move to the next
|
|
|
|
+upgrade domain until all upgrade domains have been upgraded. Such procedure will ensure no two replicas of any given
|
|
|
|
+block will be upgraded at the same time. This means we can upgrade many machines at the same time for a large cluster.
|
|
|
|
+And as the cluster continues to scale, new machines will be added to the existing upgrade domains without impact the
|
|
|
|
+parallelism of the upgrade.
|
|
|
|
+
|
|
|
|
+For an existing cluster with the default block placement policy, after switching to the new upgrade domain block
|
|
|
|
+placement policy, any newly created blocks will conform the new policy. The old blocks allocated based on the old policy
|
|
|
|
+need to migrated the new policy. There is a migrator tool you can use. See HDFS-8789 for details.
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+Settings
|
|
|
|
+-------
|
|
|
|
+
|
|
|
|
+To enable upgrade domain on your clusters, please follow these steps:
|
|
|
|
+
|
|
|
|
+* Assign datanodes to individual upgrade domain groups.
|
|
|
|
+* Enable upgrade domain block placement policy.
|
|
|
|
+* Migrate blocks allocated based on old block placement policy to the new upgrade domain policy.
|
|
|
|
+
|
|
|
|
+### Upgrade domain id assignment
|
|
|
|
+
|
|
|
|
+How a datanode maps to an upgrade domain id is defined by administrators and specific to the cluster layout.
|
|
|
|
+A common way to use the rack position of the machine as its upgrade domain id.
|
|
|
|
+
|
|
|
|
+To configure mapping from host name to its upgrade domain id, we need to use json-based host configuration file.
|
|
|
|
+by setting the following property as explained in [hdfs-default.xml](./hdfs-default.xml).
|
|
|
|
+
|
|
|
|
+| Setting | Value |
|
|
|
|
+|:---- |:---- |
|
|
|
|
+|`dfs.namenode.hosts.provider.classname` | `org.apache.hadoop.hdfs.server.blockmanagement.CombinedHostFileManager`|
|
|
|
|
+|`dfs.hosts`| the path of the json hosts file |
|
|
|
|
+
|
|
|
|
+The json hosts file defines the property for all hosts. In the following example,
|
|
|
|
+there are 4 datanodes in 2 racks; the machines at rack position 01 belong to upgrade domain 01;
|
|
|
|
+the machines at rack position 02 belong to upgrade domain 02.
|
|
|
|
+
|
|
|
|
+```json
|
|
|
|
+[
|
|
|
|
+ {
|
|
|
|
+ "hostName": "dcArackA01",
|
|
|
|
+ "upgradeDomain": "01"
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "hostName": "dcArackA02",
|
|
|
|
+ "upgradeDomain": "02"
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "hostName": "dcArackB01",
|
|
|
|
+ "upgradeDomain": "01"
|
|
|
|
+ },
|
|
|
|
+ {
|
|
|
|
+ "hostName": "dcArackB02",
|
|
|
|
+ "upgradeDomain": "02"
|
|
|
|
+ }
|
|
|
|
+]
|
|
|
|
+```
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+### Enable upgrade domain block placement policy
|
|
|
|
+
|
|
|
|
+After each datanode has been assigned an upgrade domain id, the next step is to enable
|
|
|
|
+upgrade domain block placement policy with the following configuration as explained in [hdfs-default.xml](./hdfs-default.xml).
|
|
|
|
+
|
|
|
|
+| Setting | Value |
|
|
|
|
+|:---- |:---- |
|
|
|
|
+|`dfs.block.replicator.classname`| `org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyWithUpgradeDomain` |
|
|
|
|
+
|
|
|
|
+After restarting of namenode, the new policy will be used for any new block allocation.
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+### Migration
|
|
|
|
+
|
|
|
|
+If you change the block placement policy of an existing cluster, you will need to make sure the
|
|
|
|
+blocks allocated prior to the block placement policy change conform the new block placement policy.
|
|
|
|
+
|
|
|
|
+HDFS-8789 provides the initial draft patch of a client-side migration tool. After the tool is committed,
|
|
|
|
+we will be able to describe how to use the tool.
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+Rolling restart based on upgrade domains
|
|
|
|
+-------
|
|
|
|
+
|
|
|
|
+During cluster administration, we might need to restart datanodes to pick up new configuration, new hadoop release
|
|
|
|
+or JVM version and so on. With upgrade domains enabled and all blocks on the cluster conform to the new policy, we can now
|
|
|
|
+restart datanodes in batches, one upgrade domain at a time. Whether it is manual process or via automation, the steps are
|
|
|
|
+
|
|
|
|
+* Group datanodes by upgrade domains based on dfsadmin or JMX's datanode information.
|
|
|
|
+* For each upgrade domain
|
|
|
|
+ * (Optional) put all the nodes in that upgrade domain to maintenance state (refer to [HdfsDataNodeAdminGuide.html](./HdfsDataNodeAdminGuide.html)).
|
|
|
|
+ * Restart all those nodes.
|
|
|
|
+ * Check if all datanodes are healthy after restart. Unhealthy nodes should be decommissioned.
|
|
|
|
+ * (Optional) Take all those nodes out of maintenance state.
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+Metrics
|
|
|
|
+-----------
|
|
|
|
+
|
|
|
|
+Upgrade domains are part of namenode's JMX. As explained in [HDFSCommands.html](./HDFSCommands.html), you can also verify upgrade domains using the following commands.
|
|
|
|
+
|
|
|
|
+Use `dfsadmin` to check upgrade domains at the cluster level.
|
|
|
|
+
|
|
|
|
+`hdfs dfsadmin -report`
|
|
|
|
+
|
|
|
|
+Use `fsck` to check upgrade domains of datanodes storing data at a specific path.
|
|
|
|
+
|
|
|
|
+`hdfs fsck <path> -files -blocks -upgradedomains`
|