index.md.vm 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258
  1. <!---
  2. Licensed under the Apache License, Version 2.0 (the "License");
  3. you may not use this file except in compliance with the License.
  4. You may obtain a copy of the License at
  5. http://www.apache.org/licenses/LICENSE-2.0
  6. Unless required by applicable law or agreed to in writing, software
  7. distributed under the License is distributed on an "AS IS" BASIS,
  8. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  9. See the License for the specific language governing permissions and
  10. limitations under the License. See accompanying LICENSE file.
  11. -->
  12. Apache Hadoop ${project.version}
  13. ================================
  14. Apache Hadoop ${project.version} incorporates a number of significant
  15. enhancements over the previous major release line (hadoop-2.x).
  16. This release is generally available (GA), meaning that it represents a point of
  17. API stability and quality that we consider production-ready.
  18. Overview
  19. ========
  20. Users are encouraged to read the full set of release notes.
  21. This page provides an overview of the major changes.
  22. Minimum required Java version increased from Java 7 to Java 8
  23. ------------------
  24. All Hadoop JARs are now compiled targeting a runtime version of Java 8.
  25. Users still using Java 7 or below must upgrade to Java 8.
  26. Support for erasure coding in HDFS
  27. ------------------
  28. Erasure coding is a method for durably storing data with significant space
  29. savings compared to replication. Standard encodings like Reed-Solomon (10,4)
  30. have a 1.4x space overhead, compared to the 3x overhead of standard HDFS
  31. replication.
  32. Since erasure coding imposes additional overhead during reconstruction
  33. and performs mostly remote reads, it has traditionally been used for
  34. storing colder, less frequently accessed data. Users should consider
  35. the network and CPU overheads of erasure coding when deploying this
  36. feature.
  37. More details are available in the
  38. [HDFS Erasure Coding](./hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html)
  39. documentation.
  40. YARN Timeline Service v.2
  41. -------------------
  42. We are introducing an early preview (alpha 2) of a major revision of YARN
  43. Timeline Service: v.2. YARN Timeline Service v.2 addresses two major
  44. challenges: improving scalability and reliability of Timeline Service, and
  45. enhancing usability by introducing flows and aggregation.
  46. YARN Timeline Service v.2 alpha 2 is provided so that users and developers
  47. can test it and provide feedback and suggestions for making it a ready
  48. replacement for Timeline Service v.1.x. It should be used only in a test
  49. capacity.
  50. More details are available in the
  51. [YARN Timeline Service v.2](./hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html)
  52. documentation.
  53. Shell script rewrite
  54. -------------------
  55. The Hadoop shell scripts have been rewritten to fix many long-standing
  56. bugs and include some new features. While an eye has been kept towards
  57. compatibility, some changes may break existing installations.
  58. Incompatible changes are documented in the release notes, with related
  59. discussion on [HADOOP-9902](https://issues.apache.org/jira/browse/HADOOP-9902).
  60. More details are available in the
  61. [Unix Shell Guide](./hadoop-project-dist/hadoop-common/UnixShellGuide.html)
  62. documentation. Power users will also be pleased by the
  63. [Unix Shell API](./hadoop-project-dist/hadoop-common/UnixShellAPI.html)
  64. documentation, which describes much of the new functionality, particularly
  65. related to extensibility.
  66. Shaded client jars
  67. ------------------
  68. The `hadoop-client` Maven artifact available in 2.x releases pulls
  69. Hadoop's transitive dependencies onto a Hadoop application's classpath.
  70. This can be problematic if the versions of these transitive dependencies
  71. conflict with the versions used by the application.
  72. [HADOOP-11804](https://issues.apache.org/jira/browse/HADOOP-11804) adds
  73. new `hadoop-client-api` and `hadoop-client-runtime` artifacts that
  74. shade Hadoop's dependencies into a single jar. This avoids leaking
  75. Hadoop's dependencies onto the application's classpath.
  76. Support for Opportunistic Containers and Distributed Scheduling.
  77. --------------------
  78. A notion of `ExecutionType` has been introduced, whereby Applications can
  79. now request for containers with an execution type of `Opportunistic`.
  80. Containers of this type can be dispatched for execution at an NM even if
  81. there are no resources available at the moment of scheduling. In such a
  82. case, these containers will be queued at the NM, waiting for resources to
  83. be available for it to start. Opportunistic containers are of lower priority
  84. than the default `Guaranteed` containers and are therefore preempted,
  85. if needed, to make room for Guaranteed containers. This should
  86. improve cluster utilization.
  87. Opportunistic containers are by default allocated by the central RM, but
  88. support has also been added to allow opportunistic containers to be
  89. allocated by a distributed scheduler which is implemented as an
  90. AMRMProtocol interceptor.
  91. Please see [documentation](./hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html)
  92. for more details.
  93. MapReduce task-level native optimization
  94. --------------------
  95. MapReduce has added support for a native implementation of the map output
  96. collector. For shuffle-intensive jobs, this can lead to a performance
  97. improvement of 30% or more.
  98. See the release notes for
  99. [MAPREDUCE-2841](https://issues.apache.org/jira/browse/MAPREDUCE-2841)
  100. for more detail.
  101. Support for more than 2 NameNodes.
  102. --------------------
  103. The initial implementation of HDFS NameNode high-availability provided
  104. for a single active NameNode and a single Standby NameNode. By replicating
  105. edits to a quorum of three JournalNodes, this architecture is able to
  106. tolerate the failure of any one node in the system.
  107. However, some deployments require higher degrees of fault-tolerance.
  108. This is enabled by this new feature, which allows users to run multiple
  109. standby NameNodes. For instance, by configuring three NameNodes and
  110. five JournalNodes, the cluster is able to tolerate the failure of two
  111. nodes rather than just one.
  112. The [HDFS high-availability documentation](./hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html)
  113. has been updated with instructions on how to configure more than two
  114. NameNodes.
  115. Default ports of multiple services have been changed.
  116. ------------------------
  117. Previously, the default ports of multiple Hadoop services were in the
  118. Linux ephemeral port range (32768-61000). This meant that at startup,
  119. services would sometimes fail to bind to the port due to a conflict
  120. with another application.
  121. These conflicting ports have been moved out of the ephemeral range,
  122. affecting the NameNode, Secondary NameNode, DataNode, and KMS. Our
  123. documentation has been updated appropriately, but see the release
  124. notes for [HDFS-9427](https://issues.apache.org/jira/browse/HDFS-9427) and
  125. [HADOOP-12811](https://issues.apache.org/jira/browse/HADOOP-12811)
  126. for a list of port changes.
  127. Support for Microsoft Azure Data Lake and Aliyun Object Storage System filesystem connectors
  128. ---------------------
  129. Hadoop now supports integration with Microsoft Azure Data Lake and
  130. Aliyun Object Storage System as alternative Hadoop-compatible filesystems.
  131. Intra-datanode balancer
  132. -------------------
  133. A single DataNode manages multiple disks. During normal write operation,
  134. disks will be filled up evenly. However, adding or replacing disks can
  135. lead to significant skew within a DataNode. This situation is not handled
  136. by the existing HDFS balancer, which concerns itself with inter-, not intra-,
  137. DN skew.
  138. This situation is handled by the new intra-DataNode balancing
  139. functionality, which is invoked via the `hdfs diskbalancer` CLI.
  140. See the disk balancer section in the
  141. [HDFS Commands Guide](./hadoop-project-dist/hadoop-hdfs/HDFSCommands.html)
  142. for more information.
  143. Reworked daemon and task heap management
  144. ---------------------
  145. A series of changes have been made to heap management for Hadoop daemons
  146. as well as MapReduce tasks.
  147. [HADOOP-10950](https://issues.apache.org/jira/browse/HADOOP-10950) introduces
  148. new methods for configuring daemon heap sizes.
  149. Notably, auto-tuning is now possible based on the memory size of the host,
  150. and the `HADOOP_HEAPSIZE` variable has been deprecated.
  151. See the full release notes of HADOOP-10950 for more detail.
  152. [MAPREDUCE-5785](https://issues.apache.org/jira/browse/MAPREDUCE-5785)
  153. simplifies the configuration of map and reduce task
  154. heap sizes, so the desired heap size no longer needs to be specified
  155. in both the task configuration and as a Java option.
  156. Existing configs that already specify both are not affected by this change.
  157. See the full release notes of MAPREDUCE-5785 for more details.
  158. S3Guard: Consistency and Metadata Caching for the S3A filesystem client
  159. ---------------------
  160. [HADOOP-13345](https://issues.apache.org/jira/browse/HADOOP-13345) adds an
  161. optional feature to the S3A client of Amazon S3 storage: the ability to use
  162. a DynamoDB table as a fast and consistent store of file and directory
  163. metadata.
  164. See [S3Guard](./hadoop-aws/tools/hadoop-aws/s3guard.html) for more details.
  165. HDFS Router-Based Federation
  166. ---------------------
  167. HDFS Router-Based Federation adds a RPC routing layer that provides a federated
  168. view of multiple HDFS namespaces. This is similar to the existing
  169. [ViewFs](./hadoop-project-dist/hadoop-hdfs/ViewFs.html)) and
  170. [HDFS Federation](./hadoop-project-dist/hadoop-hdfs/Federation.html)
  171. functionality, except the mount table is managed on the server-side by the
  172. routing layer rather than on the client. This simplifies access to a federated
  173. cluster for existing HDFS clients.
  174. See [HDFS-10467](https://issues.apache.org/jira/browse/HDFS-10467) and the
  175. HDFS Router-based Federation
  176. [documentation](./hadoop-project-dist/hadoop-hdfs-rbf/HDFSRouterFederation.html) for
  177. more details.
  178. API-based configuration of Capacity Scheduler queue configuration
  179. ----------------------
  180. The OrgQueue extension to the capacity scheduler provides a programmatic way to
  181. change configurations by providing a REST API that users can call to modify
  182. queue configurations. This enables automation of queue configuration management
  183. by administrators in the queue's `administer_queue` ACL.
  184. See [YARN-5734](https://issues.apache.org/jira/browse/YARN-5734) and the
  185. [Capacity Scheduler documentation](./hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html) for more information.
  186. YARN Resource Types
  187. ---------------
  188. The YARN resource model has been generalized to support user-defined countable resource types beyond CPU and memory. For instance, the cluster administrator could define resources like GPUs, software licenses, or locally-attached storage. YARN tasks can then be scheduled based on the availability of these resources.
  189. See [YARN-3926](https://issues.apache.org/jira/browse/YARN-3926) and the [YARN resource model documentation](./hadoop-yarn/hadoop-yarn-site/ResourceModel.html) for more information.
  190. Getting Started
  191. ===============
  192. The Hadoop documentation includes the information you need to get started using
  193. Hadoop. Begin with the
  194. [Single Node Setup](./hadoop-project-dist/hadoop-common/SingleCluster.html)
  195. which shows you how to set up a single-node Hadoop installation.
  196. Then move on to the
  197. [Cluster Setup](./hadoop-project-dist/hadoop-common/ClusterSetup.html)
  198. to learn how to set up a multi-node Hadoop installation.