123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261 |
- <!---
- Licensed under the Apache License, Version 2.0 (the "License");
- you may not use this file except in compliance with the License.
- You may obtain a copy of the License at
- http://www.apache.org/licenses/LICENSE-2.0
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License. See accompanying LICENSE file.
- -->
- Apache Hadoop ${project.version}
- ================================
- Apache Hadoop ${project.version} is an update to the Hadoop 3.4.x release branch.
- Overview of Changes
- ===================
- Users are encouraged to read the full set of release notes.
- This page provides an overview of the major changes.
- S3A: Upgrade AWS SDK to V2
- ----------------------------------------
- [HADOOP-18073](https://issues.apache.org/jira/browse/HADOOP-18073) S3A: Upgrade AWS SDK to V2
- This release upgrade Hadoop's AWS connector S3A from AWS SDK for Java V1 to AWS SDK for Java V2.
- This is a significant change which offers a number of new features including the ability to work with Amazon S3 Express One Zone Storage - the new high performance, single AZ storage class.
- HDFS DataNode Split one FsDatasetImpl lock to volume grain locks
- ----------------------------------------
- [HDFS-15382](https://issues.apache.org/jira/browse/HDFS-15382) Split one FsDatasetImpl lock to volume grain locks.
- Throughput is one of the core performance evaluation for DataNode instance.
- However, it does not reach the best performance especially for Federation deploy all the time although there are different improvement,
- because of the global coarse-grain lock.
- These series issues (include [HDFS-16534](https://issues.apache.org/jira/browse/HDFS-16534), [HDFS-16511](https://issues.apache.org/jira/browse/HDFS-16511), [HDFS-15382](https://issues.apache.org/jira/browse/HDFS-15382) and [HDFS-16429](https://issues.apache.org/jira/browse/HDFS-16429).)
- try to split the global coarse-grain lock to fine-grain lock which is double level lock for blockpool and volume,
- to improve the throughput and avoid lock impacts between blockpools and volumes.
- YARN Federation improvements
- ----------------------------------------
- [YARN-5597](https://issues.apache.org/jira/browse/YARN-5597) YARN Federation improvements.
- We have enhanced the YARN Federation functionality for improved usability. The enhanced features are as follows:
- 1. YARN Router now boasts a full implementation of all interfaces including the ApplicationClientProtocol, ResourceManagerAdministrationProtocol, and RMWebServiceProtocol.
- 2. YARN Router support for application cleanup and automatic offline mechanisms for subCluster.
- 3. Code improvements were undertaken for the Router and AMRMProxy, along with enhancements to previously pending functionalities.
- 4. Audit logs and Metrics for Router received upgrades.
- 5. A boost in cluster security features was achieved, with the inclusion of Kerberos support.
- 6. The page function of the router has been enhanced.
- 7. A set of commands has been added to the Router side for operating on SubClusters and Policies.
- YARN Capacity Scheduler improvements
- ----------------------------------------
- [YARN-10496](https://issues.apache.org/jira/browse/YARN-10496) Support Flexible Auto Queue Creation in Capacity Scheduler
- Capacity Scheduler resource distribution mode was extended with a new allocation mode called weight mode.
- Defining queue capacities with weights allows the users to use the newly added flexible queue auto creation mode.
- Flexible mode now supports the dynamic creation of both **parent queues** and **leaf queues**, enabling the creation of
- complex queue hierarchies application submission time.
- [YARN-10888](https://issues.apache.org/jira/browse/YARN-10888) New capacity modes for Capacity Scheduler
- Capacity Scheduler's resource distribution was completely refactored to be more flexible and extensible. There is a new concept
- called Capacity Vectors, which allows the users to mix various resource types in the hierarchy, and also in a single queue. With
- this optionally enabled feature it is now possible to define different resources with different units, like memory with GBs, vcores with
- percentage values, and GPUs/FPGAs with weights, all in the same queue.
- [YARN-10889](https://issues.apache.org/jira/browse/YARN-10889) Queue Creation in Capacity Scheduler - Various improvements
- In addition to the two new features above, there were a number of commits for improvements and bug fixes in Capacity Scheduler.
- HDFS RBF: Code Enhancements, New Features, and Bug Fixes
- ----------------------------------------
- The HDFS RBF functionality has undergone significant enhancements, encompassing over 200 commits for feature
- improvements, new functionalities, and bug fixes.
- Important features and improvements are as follows:
- **Feature**
- [HDFS-15294](https://issues.apache.org/jira/browse/HDFS-15294) HDFS Federation balance tool introduces one tool to balance data across different namespace.
- [HDFS-13522](https://issues.apache.org/jira/browse/HDFS-13522), [HDFS-16767](https://issues.apache.org/jira/browse/HDFS-16767) Support observer node from Router-Based Federation.
- **Improvement**
- [HADOOP-13144](https://issues.apache.org/jira/browse/HADOOP-13144), [HDFS-13274](https://issues.apache.org/jira/browse/HDFS-13274), [HDFS-15757](https://issues.apache.org/jira/browse/HDFS-15757)
- These tickets have enhanced IPC throughput between Router and NameNode via multiple connections per user, and optimized connection management.
- [HDFS-14090](https://issues.apache.org/jira/browse/HDFS-14090) RBF: Improved isolation for downstream name nodes. {Static}
- Router supports assignment of the dedicated number of RPC handlers to achieve isolation for all downstream nameservices
- it is configured to proxy. Since large or busy clusters may have relatively higher RPC traffic to the namenode compared to other clusters namenodes,
- this feature if enabled allows admins to configure higher number of RPC handlers for busy clusters.
- [HDFS-17128](https://issues.apache.org/jira/browse/HDFS-17128) RBF: SQLDelegationTokenSecretManager should use version of tokens updated by other routers.
- The SQLDelegationTokenSecretManager enhances performance by maintaining processed tokens in memory. However, there is
- a potential issue of router cache inconsistency due to token loading and renewal. This issue has been addressed by the
- resolution of HDFS-17128.
- [HDFS-17148](https://issues.apache.org/jira/browse/HDFS-17148) RBF: SQLDelegationTokenSecretManager must cleanup expired tokens in SQL.
- SQLDelegationTokenSecretManager, while fetching and temporarily storing tokens from SQL in a memory cache with a short TTL,
- faces an issue where expired tokens are not efficiently cleaned up, leading to a buildup of expired tokens in the SQL database.
- This issue has been addressed by the resolution of HDFS-17148.
- **Others**
- Other changes to HDFS RBF include WebUI, command line, and other improvements. Please refer to the release document.
- HDFS EC: Code Enhancements and Bug Fixes
- ----------------------------------------
- HDFS EC has made code improvements and fixed some bugs.
- Important improvements and bugs are as follows:
- **Improvement**
- [HDFS-16613](https://issues.apache.org/jira/browse/HDFS-16613) EC: Improve performance of decommissioning dn with many ec blocks.
- In a hdfs cluster with a lot of EC blocks, decommission a dn is very slow. The reason is unlike replication blocks can be replicated
- from any dn which has the same block replication, the ec block have to be replicated from the decommissioning dn.
- The configurations `dfs.namenode.replication.max-streams` and `dfs.namenode.replication.max-streams-hard-limit` will limit
- the replication speed, but increase these configurations will create risk to the whole cluster's network. So it should add a new
- configuration to limit the decommissioning dn, distinguished from the cluster wide max-streams limit.
- [HDFS-16663](https://issues.apache.org/jira/browse/HDFS-16663) EC: Allow block reconstruction pending timeout refreshable to increase decommission performance.
- In [HDFS-16613](https://issues.apache.org/jira/browse/HDFS-16613), increase the value of `dfs.namenode.replication.max-streams-hard-limit` would maximize the IO
- performance of the decommissioning DN, which has a lot of EC blocks. Besides this, we also need to decrease the value of
- `dfs.namenode.reconstruction.pending.timeout-sec`, default is 5 minutes, to shorten the interval time for checking
- pendingReconstructions. Or the decommissioning node would be idle to wait for copy tasks in most of this 5 minutes.
- In decommission progress, we may need to reconfigure these 2 parameters several times. In [HDFS-14560](https://issues.apache.org/jira/browse/HDFS-14560), the
- `dfs.namenode.replication.max-streams-hard-limit` can already be reconfigured dynamically without namenode restart. And
- the `dfs.namenode.reconstruction.pending.timeout-sec` parameter also need to be reconfigured dynamically.
- **Bug**
- [HDFS-16456](https://issues.apache.org/jira/browse/HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication.
- In below scenario, decommission will fail by `TOO_MANY_NODES_ON_RACK` reason:
- - Enable EC policy, such as RS-6-3-1024k.
- - The rack number in this cluster is equal with or less than the replication number(9)
- - A rack only has one DN, and decommission this DN.
- This issue has been addressed by the resolution of HDFS-16456.
- [HDFS-17094](https://issues.apache.org/jira/browse/HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes.
- During block recovery, the `RecoveryTaskStriped` in the datanode expects a one-to-one correspondence between
- `rBlock.getLocations()` and `rBlock.getBlockIndices()`. However, if there are stale locations during a NameNode heartbeat,
- this correspondence may be disrupted. Specifically, although there are no stale locations in `recoveryLocations`, the block indices
- array remains complete. This discrepancy causes `BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate an incorrect
- internal block ID, leading to a failure in the recovery process as the corresponding datanode cannot locate the replica.
- This issue has been addressed by the resolution of HDFS-17094.
- [HDFS-17284](https://issues.apache.org/jira/browse/HDFS-17284). EC: Fix int overflow in calculating numEcReplicatedTasks and numReplicationTasks during block recovery.
- Due to an integer overflow in the calculation of numReplicationTasks or numEcReplicatedTasks, the NameNode's configuration
- parameter `dfs.namenode.replication.max-streams-hard-limit` failed to take effect. This led to an excessive number of tasks
- being sent to the DataNodes, consequently occupying too much of their memory.
- This issue has been addressed by the resolution of HDFS-17284.
- **Others**
- Other improvements and fixes for HDFS EC, Please refer to the release document.
- Transitive CVE fixes
- --------------------
- A lot of dependencies have been upgraded to address recent CVEs.
- Many of the CVEs were not actually exploitable through the Hadoop
- so much of this work is just due diligence.
- However, applications which have all the library is on a class path may
- be vulnerable, and the upgrades should also reduce the number of false
- positives security scanners report.
- We have not been able to upgrade every single dependency to the latest
- version there is. Some of those changes are fundamentally incompatible.
- If you have concerns about the state of a specific library, consult the Apache JIRA
- issue tracker to see if an issue has been filed, discussions have taken place about
- the library in question, and whether or not there is already a fix in the pipeline.
- *Please don't file new JIRAs about dependency-X.Y.Z having a CVE without
- searching for any existing issue first*
- As an open-source project, contributions in this area are always welcome,
- especially in testing the active branches, testing applications downstream of
- those branches and of whether updated dependencies trigger regressions.
- Security Advisory
- =================
- Hadoop HDFS is a distributed filesystem allowing remote
- callers to read and write data.
- Hadoop YARN is a distributed job submission/execution
- engine allowing remote callers to submit arbitrary
- work into the cluster.
- Unless a Hadoop cluster is deployed with
- [caller authentication with Kerberos](./hadoop-project-dist/hadoop-common/SecureMode.html),
- anyone with network access to the servers has unrestricted access to the data
- and the ability to run whatever code they want in the system.
- In production, there are generally three deployment patterns which
- can, with care, keep data and computing resources private.
- 1. Physical cluster: *configure Hadoop security*, usually bonded to the
- enterprise Kerberos/Active Directory systems.
- Good.
- 2. Cloud: transient or persistent single or multiple user/tenant cluster
- with private VLAN *and security*.
- Good.
- Consider [Apache Knox](https://knox.apache.org/) for managing remote
- access to the cluster.
- 3. Cloud: transient single user/tenant cluster with private VLAN
- *and no security at all*.
- Requires careful network configuration as this is the sole
- means of securing the cluster..
- Consider [Apache Knox](https://knox.apache.org/) for managing
- remote access to the cluster.
- *If you deploy a Hadoop cluster in-cloud without security, and without configuring a VLAN
- to restrict access to trusted users, you are implicitly sharing your data and
- computing resources with anyone with network access*
- If you do deploy an insecure cluster this way then port scanners will inevitably
- find it and submit crypto-mining jobs. If this happens to you, please do not report
- this as a CVE or security issue: it is _utterly predictable_. Secure *your cluster* if
- you want to remain exclusively *your cluster*.
- Finally, if you are using Hadoop as a service deployed/managed by someone else,
- do determine what security their products offer and make sure it meets your requirements.
- Getting Started
- ===============
- The Hadoop documentation includes the information you need to get started using
- Hadoop. Begin with the
- [Single Node Setup](./hadoop-project-dist/hadoop-common/SingleCluster.html)
- which shows you how to set up a single-node Hadoop installation.
- Then move on to the
- [Cluster Setup](./hadoop-project-dist/hadoop-common/ClusterSetup.html)
- to learn how to set up a multi-node Hadoop installation.
- Before deploying Hadoop in production, read
- [Hadoop in Secure Mode](./hadoop-project-dist/hadoop-common/SecureMode.html),
- and follow its instructions to secure your cluster.
|