index.md.vm 15 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261
  1. <!---
  2. Licensed under the Apache License, Version 2.0 (the "License");
  3. you may not use this file except in compliance with the License.
  4. You may obtain a copy of the License at
  5. http://www.apache.org/licenses/LICENSE-2.0
  6. Unless required by applicable law or agreed to in writing, software
  7. distributed under the License is distributed on an "AS IS" BASIS,
  8. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  9. See the License for the specific language governing permissions and
  10. limitations under the License. See accompanying LICENSE file.
  11. -->
  12. Apache Hadoop ${project.version}
  13. ================================
  14. Apache Hadoop ${project.version} is an update to the Hadoop 3.4.x release branch.
  15. Overview of Changes
  16. ===================
  17. Users are encouraged to read the full set of release notes.
  18. This page provides an overview of the major changes.
  19. S3A: Upgrade AWS SDK to V2
  20. ----------------------------------------
  21. [HADOOP-18073](https://issues.apache.org/jira/browse/HADOOP-18073) S3A: Upgrade AWS SDK to V2
  22. This release upgrade Hadoop's AWS connector S3A from AWS SDK for Java V1 to AWS SDK for Java V2.
  23. This is a significant change which offers a number of new features including the ability to work with Amazon S3 Express One Zone Storage - the new high performance, single AZ storage class.
  24. HDFS DataNode Split one FsDatasetImpl lock to volume grain locks
  25. ----------------------------------------
  26. [HDFS-15382](https://issues.apache.org/jira/browse/HDFS-15382) Split one FsDatasetImpl lock to volume grain locks.
  27. Throughput is one of the core performance evaluation for DataNode instance.
  28. However, it does not reach the best performance especially for Federation deploy all the time although there are different improvement,
  29. because of the global coarse-grain lock.
  30. These series issues (include [HDFS-16534](https://issues.apache.org/jira/browse/HDFS-16534), [HDFS-16511](https://issues.apache.org/jira/browse/HDFS-16511), [HDFS-15382](https://issues.apache.org/jira/browse/HDFS-15382) and [HDFS-16429](https://issues.apache.org/jira/browse/HDFS-16429).)
  31. try to split the global coarse-grain lock to fine-grain lock which is double level lock for blockpool and volume,
  32. to improve the throughput and avoid lock impacts between blockpools and volumes.
  33. YARN Federation improvements
  34. ----------------------------------------
  35. [YARN-5597](https://issues.apache.org/jira/browse/YARN-5597) YARN Federation improvements.
  36. We have enhanced the YARN Federation functionality for improved usability. The enhanced features are as follows:
  37. 1. YARN Router now boasts a full implementation of all interfaces including the ApplicationClientProtocol, ResourceManagerAdministrationProtocol, and RMWebServiceProtocol.
  38. 2. YARN Router support for application cleanup and automatic offline mechanisms for subCluster.
  39. 3. Code improvements were undertaken for the Router and AMRMProxy, along with enhancements to previously pending functionalities.
  40. 4. Audit logs and Metrics for Router received upgrades.
  41. 5. A boost in cluster security features was achieved, with the inclusion of Kerberos support.
  42. 6. The page function of the router has been enhanced.
  43. 7. A set of commands has been added to the Router side for operating on SubClusters and Policies.
  44. YARN Capacity Scheduler improvements
  45. ----------------------------------------
  46. [YARN-10496](https://issues.apache.org/jira/browse/YARN-10496) Support Flexible Auto Queue Creation in Capacity Scheduler
  47. Capacity Scheduler resource distribution mode was extended with a new allocation mode called weight mode.
  48. Defining queue capacities with weights allows the users to use the newly added flexible queue auto creation mode.
  49. Flexible mode now supports the dynamic creation of both **parent queues** and **leaf queues**, enabling the creation of
  50. complex queue hierarchies application submission time.
  51. [YARN-10888](https://issues.apache.org/jira/browse/YARN-10888) New capacity modes for Capacity Scheduler
  52. Capacity Scheduler's resource distribution was completely refactored to be more flexible and extensible. There is a new concept
  53. called Capacity Vectors, which allows the users to mix various resource types in the hierarchy, and also in a single queue. With
  54. this optionally enabled feature it is now possible to define different resources with different units, like memory with GBs, vcores with
  55. percentage values, and GPUs/FPGAs with weights, all in the same queue.
  56. [YARN-10889](https://issues.apache.org/jira/browse/YARN-10889) Queue Creation in Capacity Scheduler - Various improvements
  57. In addition to the two new features above, there were a number of commits for improvements and bug fixes in Capacity Scheduler.
  58. HDFS RBF: Code Enhancements, New Features, and Bug Fixes
  59. ----------------------------------------
  60. The HDFS RBF functionality has undergone significant enhancements, encompassing over 200 commits for feature
  61. improvements, new functionalities, and bug fixes.
  62. Important features and improvements are as follows:
  63. **Feature**
  64. [HDFS-15294](https://issues.apache.org/jira/browse/HDFS-15294) HDFS Federation balance tool introduces one tool to balance data across different namespace.
  65. [HDFS-13522](https://issues.apache.org/jira/browse/HDFS-13522), [HDFS-16767](https://issues.apache.org/jira/browse/HDFS-16767) Support observer node from Router-Based Federation.
  66. **Improvement**
  67. [HADOOP-13144](https://issues.apache.org/jira/browse/HADOOP-13144), [HDFS-13274](https://issues.apache.org/jira/browse/HDFS-13274), [HDFS-15757](https://issues.apache.org/jira/browse/HDFS-15757)
  68. These tickets have enhanced IPC throughput between Router and NameNode via multiple connections per user, and optimized connection management.
  69. [HDFS-14090](https://issues.apache.org/jira/browse/HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}
  70. Router supports assignment of the dedicated number of RPC handlers to achieve isolation for all downstream nameservices
  71. it is configured to proxy. Since large or busy clusters may have relatively higher RPC traffic to the namenode compared to other clusters namenodes,
  72. this feature if enabled allows admins to configure higher number of RPC handlers for busy clusters.
  73. [HDFS-17128](https://issues.apache.org/jira/browse/HDFS-17128) RBF: SQLDelegationTokenSecretManager should use version of tokens updated by other routers.
  74. The SQLDelegationTokenSecretManager enhances performance by maintaining processed tokens in memory. However, there is
  75. a potential issue of router cache inconsistency due to token loading and renewal. This issue has been addressed by the
  76. resolution of HDFS-17128.
  77. [HDFS-17148](https://issues.apache.org/jira/browse/HDFS-17148) RBF: SQLDelegationTokenSecretManager must cleanup expired tokens in SQL.
  78. SQLDelegationTokenSecretManager, while fetching and temporarily storing tokens from SQL in a memory cache with a short TTL,
  79. faces an issue where expired tokens are not efficiently cleaned up, leading to a buildup of expired tokens in the SQL database.
  80. This issue has been addressed by the resolution of HDFS-17148.
  81. **Others**
  82. Other changes to HDFS RBF include WebUI, command line, and other improvements. Please refer to the release document.
  83. HDFS EC: Code Enhancements and Bug Fixes
  84. ----------------------------------------
  85. HDFS EC has made code improvements and fixed some bugs.
  86. Important improvements and bugs are as follows:
  87. **Improvement**
  88. [HDFS-16613](https://issues.apache.org/jira/browse/HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks.
  89. In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. The reason is unlike replication blocks can be replicated
  90. from any dn which has the same block replication, the ec block have to be replicated from the decommissioning dn.
  91. The configurations `dfs.namenode.replication.max-streams` and `dfs.namenode.replication.max-streams-hard-limit` will limit
  92. the replication speed, but increase these configurations will create risk to the whole cluster's network. So it should add a new
  93. configuration to limit the decommissioning dn, distinguished from the cluster wide max-streams limit.
  94. [HDFS-16663](https://issues.apache.org/jira/browse/HDFS-16663) EC: Allow block reconstruction pending timeout refreshable to increase decommission performance.
  95. In [HDFS-16613](https://issues.apache.org/jira/browse/HDFS-16613), increase the value of `dfs.namenode.replication.max-streams-hard-limit` would maximize the IO
  96. performance of the decommissioning DN, which has a lot of EC blocks. Besides this, we also need to decrease the value of
  97. `dfs.namenode.reconstruction.pending.timeout-sec`, default is 5 minutes, to shorten the interval time for checking
  98. pendingReconstructions. Or the decommissioning node would be idle to wait for copy tasks in most of this 5 minutes.
  99. In decommission progress, we may need to reconfigure these 2 parameters several times. In [HDFS-14560](https://issues.apache.org/jira/browse/HDFS-14560), the
  100. `dfs.namenode.replication.max-streams-hard-limit` can already be reconfigured dynamically without namenode restart. And
  101. the `dfs.namenode.reconstruction.pending.timeout-sec` parameter also need to be reconfigured dynamically.
  102. **Bug**
  103. [HDFS-16456](https://issues.apache.org/jira/browse/HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication.
  104. In below scenario, decommission will fail by `TOO_MANY_NODES_ON_RACK` reason:
  105. - Enable EC policy, such as RS-6-3-1024k.
  106. - The rack number in this cluster is equal with or less than the replication number(9)
  107. - A rack only has one DN, and decommission this DN.
  108. This issue has been addressed by the resolution of HDFS-16456.
  109. [HDFS-17094](https://issues.apache.org/jira/browse/HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes.
  110. During block recovery, the `RecoveryTaskStriped` in the datanode expects a one-to-one correspondence between
  111. `rBlock.getLocations()` and `rBlock.getBlockIndices()`. However, if there are stale locations during a NameNode heartbeat,
  112. this correspondence may be disrupted. Specifically, although there are no stale locations in `recoveryLocations`, the block indices
  113. array remains complete. This discrepancy causes `BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate an incorrect
  114. internal block ID, leading to a failure in the recovery process as the corresponding datanode cannot locate the replica.
  115. This issue has been addressed by the resolution of HDFS-17094.
  116. [HDFS-17284](https://issues.apache.org/jira/browse/HDFS-17284). EC: Fix int overflow in calculating numEcReplicatedTasks and numReplicationTasks during block recovery.
  117. Due to an integer overflow in the calculation of numReplicationTasks or numEcReplicatedTasks, the NameNode's configuration
  118. parameter `dfs.namenode.replication.max-streams-hard-limit` failed to take effect. This led to an excessive number of tasks
  119. being sent to the DataNodes, consequently occupying too much of their memory.
  120. This issue has been addressed by the resolution of HDFS-17284.
  121. **Others**
  122. Other improvements and fixes for HDFS EC, Please refer to the release document.
  123. Transitive CVE fixes
  124. --------------------
  125. A lot of dependencies have been upgraded to address recent CVEs.
  126. Many of the CVEs were not actually exploitable through the Hadoop
  127. so much of this work is just due diligence.
  128. However, applications which have all the library is on a class path may
  129. be vulnerable, and the upgrades should also reduce the number of false
  130. positives security scanners report.
  131. We have not been able to upgrade every single dependency to the latest
  132. version there is. Some of those changes are fundamentally incompatible.
  133. If you have concerns about the state of a specific library, consult the Apache JIRA
  134. issue tracker to see if an issue has been filed, discussions have taken place about
  135. the library in question, and whether or not there is already a fix in the pipeline.
  136. *Please don't file new JIRAs about dependency-X.Y.Z having a CVE without
  137. searching for any existing issue first*
  138. As an open-source project, contributions in this area are always welcome,
  139. especially in testing the active branches, testing applications downstream of
  140. those branches and of whether updated dependencies trigger regressions.
  141. Security Advisory
  142. =================
  143. Hadoop HDFS is a distributed filesystem allowing remote
  144. callers to read and write data.
  145. Hadoop YARN is a distributed job submission/execution
  146. engine allowing remote callers to submit arbitrary
  147. work into the cluster.
  148. Unless a Hadoop cluster is deployed with
  149. [caller authentication with Kerberos](./hadoop-project-dist/hadoop-common/SecureMode.html),
  150. anyone with network access to the servers has unrestricted access to the data
  151. and the ability to run whatever code they want in the system.
  152. In production, there are generally three deployment patterns which
  153. can, with care, keep data and computing resources private.
  154. 1. Physical cluster: *configure Hadoop security*, usually bonded to the
  155. enterprise Kerberos/Active Directory systems.
  156. Good.
  157. 2. Cloud: transient or persistent single or multiple user/tenant cluster
  158. with private VLAN *and security*.
  159. Good.
  160. Consider [Apache Knox](https://knox.apache.org/) for managing remote
  161. access to the cluster.
  162. 3. Cloud: transient single user/tenant cluster with private VLAN
  163. *and no security at all*.
  164. Requires careful network configuration as this is the sole
  165. means of securing the cluster..
  166. Consider [Apache Knox](https://knox.apache.org/) for managing
  167. remote access to the cluster.
  168. *If you deploy a Hadoop cluster in-cloud without security, and without configuring a VLAN
  169. to restrict access to trusted users, you are implicitly sharing your data and
  170. computing resources with anyone with network access*
  171. If you do deploy an insecure cluster this way then port scanners will inevitably
  172. find it and submit crypto-mining jobs. If this happens to you, please do not report
  173. this as a CVE or security issue: it is _utterly predictable_. Secure *your cluster* if
  174. you want to remain exclusively *your cluster*.
  175. Finally, if you are using Hadoop as a service deployed/managed by someone else,
  176. do determine what security their products offer and make sure it meets your requirements.
  177. Getting Started
  178. ===============
  179. The Hadoop documentation includes the information you need to get started using
  180. Hadoop. Begin with the
  181. [Single Node Setup](./hadoop-project-dist/hadoop-common/SingleCluster.html)
  182. which shows you how to set up a single-node Hadoop installation.
  183. Then move on to the
  184. [Cluster Setup](./hadoop-project-dist/hadoop-common/ClusterSetup.html)
  185. to learn how to set up a multi-node Hadoop installation.
  186. Before deploying Hadoop in production, read
  187. [Hadoop in Secure Mode](./hadoop-project-dist/hadoop-common/SecureMode.html),
  188. and follow its instructions to secure your cluster.