|
@@ -14,6 +14,10 @@
|
|
|
|
|
|
# Hadoop-AWS module: Integration with Amazon Web Services
|
|
# Hadoop-AWS module: Integration with Amazon Web Services
|
|
|
|
|
|
|
|
+<!-- MACRO{toc|fromDepth=0|toDepth=5} -->
|
|
|
|
+
|
|
|
|
+## Overview
|
|
|
|
+
|
|
The `hadoop-aws` module provides support for AWS integration. The generated
|
|
The `hadoop-aws` module provides support for AWS integration. The generated
|
|
JAR file, `hadoop-aws.jar` also declares a transitive dependency on all
|
|
JAR file, `hadoop-aws.jar` also declares a transitive dependency on all
|
|
external artifacts which are needed for this support —enabling downstream
|
|
external artifacts which are needed for this support —enabling downstream
|
|
@@ -22,18 +26,19 @@ applications to easily use this support.
|
|
To make it part of Apache Hadoop's default classpath, simply make sure that
|
|
To make it part of Apache Hadoop's default classpath, simply make sure that
|
|
HADOOP_OPTIONAL_TOOLS in hadoop-env.sh has 'hadoop-aws' in the list.
|
|
HADOOP_OPTIONAL_TOOLS in hadoop-env.sh has 'hadoop-aws' in the list.
|
|
|
|
|
|
-Features
|
|
|
|
|
|
+### Features
|
|
|
|
|
|
-1. The "classic" `s3:` filesystem for storing objects in Amazon S3 Storage
|
|
|
|
|
|
+1. The "classic" `s3:` filesystem for storing objects in Amazon S3 Storage.
|
|
|
|
+**NOTE: `s3:` is being phased out. Use `s3n:` or `s3a:` instead.**
|
|
1. The second-generation, `s3n:` filesystem, making it easy to share
|
|
1. The second-generation, `s3n:` filesystem, making it easy to share
|
|
-data between hadoop and other applications via the S3 object store
|
|
|
|
|
|
+data between hadoop and other applications via the S3 object store.
|
|
1. The third generation, `s3a:` filesystem. Designed to be a switch in
|
|
1. The third generation, `s3a:` filesystem. Designed to be a switch in
|
|
replacement for `s3n:`, this filesystem binding supports larger files and promises
|
|
replacement for `s3n:`, this filesystem binding supports larger files and promises
|
|
higher performance.
|
|
higher performance.
|
|
|
|
|
|
The specifics of using these filesystems are documented below.
|
|
The specifics of using these filesystems are documented below.
|
|
|
|
|
|
-## Warning #1: Object Stores are not filesystems.
|
|
|
|
|
|
+### Warning #1: Object Stores are not filesystems.
|
|
|
|
|
|
Amazon S3 is an example of "an object store". In order to achieve scalability
|
|
Amazon S3 is an example of "an object store". In order to achieve scalability
|
|
and especially high availability, S3 has —as many other cloud object stores have
|
|
and especially high availability, S3 has —as many other cloud object stores have
|
|
@@ -50,14 +55,14 @@ recursive file-by-file operations. They take time at least proportional to
|
|
the number of files, during which time partial updates may be visible. If
|
|
the number of files, during which time partial updates may be visible. If
|
|
the operations are interrupted, the filesystem is left in an intermediate state.
|
|
the operations are interrupted, the filesystem is left in an intermediate state.
|
|
|
|
|
|
-## Warning #2: Because Object stores don't track modification times of directories,
|
|
|
|
|
|
+### Warning #2: Because Object stores don't track modification times of directories,
|
|
features of Hadoop relying on this can have unexpected behaviour. E.g. the
|
|
features of Hadoop relying on this can have unexpected behaviour. E.g. the
|
|
AggregatedLogDeletionService of YARN will not remove the appropriate logfiles.
|
|
AggregatedLogDeletionService of YARN will not remove the appropriate logfiles.
|
|
|
|
|
|
For further discussion on these topics, please consult
|
|
For further discussion on these topics, please consult
|
|
[The Hadoop FileSystem API Definition](../../../hadoop-project-dist/hadoop-common/filesystem/index.html).
|
|
[The Hadoop FileSystem API Definition](../../../hadoop-project-dist/hadoop-common/filesystem/index.html).
|
|
|
|
|
|
-## Warning #3: your AWS credentials are valuable
|
|
|
|
|
|
+### Warning #3: your AWS credentials are valuable
|
|
|
|
|
|
Your AWS credentials not only pay for services, they offer read and write
|
|
Your AWS credentials not only pay for services, they offer read and write
|
|
access to the data. Anyone with the credentials can not only read your datasets
|
|
access to the data. Anyone with the credentials can not only read your datasets
|
|
@@ -101,6 +106,29 @@ If you do any of these: change your credentials immediately!
|
|
|
|
|
|
### Other properties
|
|
### Other properties
|
|
|
|
|
|
|
|
+ <property>
|
|
|
|
+ <name>fs.s3.buffer.dir</name>
|
|
|
|
+ <value>${hadoop.tmp.dir}/s3</value>
|
|
|
|
+ <description>Determines where on the local filesystem the s3:/s3n: filesystem
|
|
|
|
+ should store files before sending them to S3
|
|
|
|
+ (or after retrieving them from S3).
|
|
|
|
+ </description>
|
|
|
|
+ </property>
|
|
|
|
+
|
|
|
|
+ <property>
|
|
|
|
+ <name>fs.s3.maxRetries</name>
|
|
|
|
+ <value>4</value>
|
|
|
|
+ <description>The maximum number of retries for reading or writing files to
|
|
|
|
+ S3, before we signal failure to the application.
|
|
|
|
+ </description>
|
|
|
|
+ </property>
|
|
|
|
+
|
|
|
|
+ <property>
|
|
|
|
+ <name>fs.s3.sleepTimeSeconds</name>
|
|
|
|
+ <value>10</value>
|
|
|
|
+ <description>The number of seconds to sleep between each S3 retry.
|
|
|
|
+ </description>
|
|
|
|
+ </property>
|
|
|
|
|
|
<property>
|
|
<property>
|
|
<name>fs.s3n.block.size</name>
|
|
<name>fs.s3n.block.size</name>
|
|
@@ -138,7 +166,7 @@ If you do any of these: change your credentials immediately!
|
|
<name>fs.s3n.server-side-encryption-algorithm</name>
|
|
<name>fs.s3n.server-side-encryption-algorithm</name>
|
|
<value></value>
|
|
<value></value>
|
|
<description>Specify a server-side encryption algorithm for S3.
|
|
<description>Specify a server-side encryption algorithm for S3.
|
|
- The default is NULL, and the only other currently allowable value is AES256.
|
|
|
|
|
|
+ Unset by default, and the only other currently allowable value is AES256.
|
|
</description>
|
|
</description>
|
|
</property>
|
|
</property>
|
|
|
|
|
|
@@ -358,6 +386,13 @@ this capability.
|
|
implementations can still be used</description>
|
|
implementations can still be used</description>
|
|
</property>
|
|
</property>
|
|
|
|
|
|
|
|
+ <property>
|
|
|
|
+ <name>fs.s3a.server-side-encryption-algorithm</name>
|
|
|
|
+ <description>Specify a server-side encryption algorithm for s3a: file system.
|
|
|
|
+ Unset by default, and the only other currently allowable value is AES256.
|
|
|
|
+ </description>
|
|
|
|
+ </property>
|
|
|
|
+
|
|
<property>
|
|
<property>
|
|
<name>fs.s3a.buffer.dir</name>
|
|
<name>fs.s3a.buffer.dir</name>
|
|
<value>${hadoop.tmp.dir}/s3a</value>
|
|
<value>${hadoop.tmp.dir}/s3a</value>
|
|
@@ -365,6 +400,13 @@ this capability.
|
|
uploads to. No effect if fs.s3a.fast.upload is true.</description>
|
|
uploads to. No effect if fs.s3a.fast.upload is true.</description>
|
|
</property>
|
|
</property>
|
|
|
|
|
|
|
|
+ <property>
|
|
|
|
+ <name>fs.s3a.block.size</name>
|
|
|
|
+ <value>33554432</value>
|
|
|
|
+ <description>Block size to use when reading files using s3a: file system.
|
|
|
|
+ </description>
|
|
|
|
+ </property>
|
|
|
|
+
|
|
<property>
|
|
<property>
|
|
<name>fs.s3a.impl</name>
|
|
<name>fs.s3a.impl</name>
|
|
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
|
|
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
|
|
@@ -406,7 +448,7 @@ settings could cause memory overflow. Up to `fs.s3a.threads.max` parallel
|
|
(part)uploads are active. Furthermore, up to `fs.s3a.max.total.tasks`
|
|
(part)uploads are active. Furthermore, up to `fs.s3a.max.total.tasks`
|
|
additional part(uploads) can be waiting (and thus memory buffers are created).
|
|
additional part(uploads) can be waiting (and thus memory buffers are created).
|
|
The memory buffer is uploaded as a single upload if it is not larger than
|
|
The memory buffer is uploaded as a single upload if it is not larger than
|
|
-`fs.s3a.multipart.threshold`. Else, a multi-part upload is initiatated and
|
|
|
|
|
|
+`fs.s3a.multipart.threshold`. Else, a multi-part upload is initiated and
|
|
parts of size `fs.s3a.multipart.size` are used to protect against overflowing
|
|
parts of size `fs.s3a.multipart.size` are used to protect against overflowing
|
|
the available memory. These settings should be tuned to the envisioned
|
|
the available memory. These settings should be tuned to the envisioned
|
|
workflow (some large files, many small ones, ...) and the physical
|
|
workflow (some large files, many small ones, ...) and the physical
|
|
@@ -506,7 +548,7 @@ Example:
|
|
</property>
|
|
</property>
|
|
</configuration>
|
|
</configuration>
|
|
|
|
|
|
-## File `contract-test-options.xml`
|
|
|
|
|
|
+### File `contract-test-options.xml`
|
|
|
|
|
|
The file `hadoop-tools/hadoop-aws/src/test/resources/contract-test-options.xml`
|
|
The file `hadoop-tools/hadoop-aws/src/test/resources/contract-test-options.xml`
|
|
must be created and configured for the test filesystems.
|
|
must be created and configured for the test filesystems.
|
|
@@ -518,7 +560,7 @@ The standard S3 authentication details must also be provided. This can be
|
|
through copy-and-paste of the `auth-keys.xml` credentials, or it can be
|
|
through copy-and-paste of the `auth-keys.xml` credentials, or it can be
|
|
through direct XInclude inclusion.
|
|
through direct XInclude inclusion.
|
|
|
|
|
|
-#### s3://
|
|
|
|
|
|
+### s3://
|
|
|
|
|
|
The filesystem name must be defined in the property `fs.contract.test.fs.s3`.
|
|
The filesystem name must be defined in the property `fs.contract.test.fs.s3`.
|
|
|
|
|