|
@@ -20,6 +20,27 @@ This document covers the architecture and implementation details of the S3A comm
|
|
|
|
|
|
For information on using the committers, see [the S3A Committers](./committer.html).
|
|
|
|
|
|
+### January 2021 Update
|
|
|
+
|
|
|
+Now that S3 is fully consistent, problems related to inconsistent
|
|
|
+directory listings have gone. However the rename problem exists: committing
|
|
|
+work by renaming directories is unsafe as well as horribly slow.
|
|
|
+
|
|
|
+This architecture document, and the committers, were written at a time
|
|
|
+when S3 was inconsistent. The two committers addressed this problem differently
|
|
|
+
|
|
|
+* Staging Committer: rely on a cluster HDFS filesystem for safely propagating
|
|
|
+ the lists of files to commit from workers to the job manager/driver.
|
|
|
+* Magic Committer: require S3Guard to offer consistent directory listings
|
|
|
+ on the object store.
|
|
|
+
|
|
|
+With consistent S3, the Magic Committer can be safely used with any S3 bucket.
|
|
|
+The choice of which to use, then, is matter for experimentation.
|
|
|
+
|
|
|
+This architecture document was written in 2017, a time when S3 was only
|
|
|
+consistent when an extra consistency layer such as S3Guard was used.
|
|
|
+The document indicates where requirements/constraints which existed then
|
|
|
+are now obsolete.
|
|
|
|
|
|
## Problem: Efficient, reliable commits of work to consistent S3 buckets
|
|
|
|
|
@@ -49,10 +70,10 @@ can be executed server-side, but as it does not complete until the in-cluster
|
|
|
copy has completed, it takes time proportional to the amount of data.
|
|
|
|
|
|
The rename overhead is the most visible issue, but it is not the most dangerous.
|
|
|
-That is the fact that path listings have no consistency guarantees, and may
|
|
|
-lag the addition or deletion of files.
|
|
|
-If files are not listed, the commit operation will *not* copy them, and
|
|
|
-so they will not appear in the final output.
|
|
|
+That is the fact that until late 2020, path listings had no consistency guarantees,
|
|
|
+and may have lagged the addition or deletion of files.
|
|
|
+If files were not listed, the commit operation would *not* copy them, and
|
|
|
+so they would not appear in the final output.
|
|
|
|
|
|
The solution to this problem is closely coupled to the S3 protocol itself:
|
|
|
delayed completion of multi-part PUT operations
|
|
@@ -828,6 +849,8 @@ commit sequence in `Task.done()`, when `talkToAMTGetPermissionToCommit()`
|
|
|
|
|
|
# Requirements of an S3A Committer
|
|
|
|
|
|
+The design requirements of the S3A committer were
|
|
|
+
|
|
|
1. Support an eventually consistent S3 object store as a reliable direct
|
|
|
destination of work through the S3A filesystem client.
|
|
|
1. Efficient: implies no rename, and a minimal amount of delay in the job driver's
|
|
@@ -841,6 +864,7 @@ the job, and any previous incompleted jobs.
|
|
|
1. Security: not to permit privilege escalation from other users with
|
|
|
write access to the same file system(s).
|
|
|
|
|
|
+
|
|
|
|
|
|
## Features of S3 and the S3A Client
|
|
|
|
|
@@ -852,8 +876,8 @@ MR committer algorithms have significant performance problems.
|
|
|
|
|
|
1. Single-object renames are implemented as a copy and delete sequence.
|
|
|
1. COPY is atomic, but overwrites cannot be prevented.
|
|
|
-1. Amazon S3 is eventually consistent on listings, deletes and updates.
|
|
|
-1. Amazon S3 has create consistency, however, the negative response of a HEAD/GET
|
|
|
+1. [Obsolete] Amazon S3 is eventually consistent on listings, deletes and updates.
|
|
|
+1. [Obsolete] Amazon S3 has create consistency, however, the negative response of a HEAD/GET
|
|
|
performed on a path before an object was created can be cached, unintentionally
|
|
|
creating a create inconsistency. The S3A client library does perform such a check,
|
|
|
on `create()` and `rename()` to check the state of the destination path, and
|
|
@@ -872,11 +896,12 @@ data, with the `S3ABlockOutputStream` of HADOOP-13560 uploading written data
|
|
|
as parts of a multipart PUT once the threshold set in the configuration
|
|
|
parameter `fs.s3a.multipart.size` (default: 100MB).
|
|
|
|
|
|
-[S3Guard](./s3guard.html) adds an option of consistent view of the filesystem
|
|
|
+[S3Guard](./s3guard.html) added an option of consistent view of the filesystem
|
|
|
to all processes using the shared DynamoDB table as the authoritative store of
|
|
|
-metadata. Some S3-compatible object stores are fully consistent; the
|
|
|
-proposed algorithm is designed to work with such object stores without the
|
|
|
-need for any DynamoDB tables.
|
|
|
+metadata.
|
|
|
+The proposed algorithm was designed to work with such object stores without the
|
|
|
+need for any DynamoDB tables. Since AWS S3 became consistent in 2020, this
|
|
|
+means that they will work directly with the store.
|
|
|
|
|
|
## Related work: Spark's `DirectOutputCommitter`
|
|
|
|
|
@@ -1246,8 +1271,8 @@ for parallel committing of work, including all the error handling based on
|
|
|
the Netflix experience.
|
|
|
|
|
|
It differs in that it directly streams data to S3 (there is no staging),
|
|
|
-and it also stores the lists of pending commits in S3 too. That mandates
|
|
|
-consistent metadata on S3, which S3Guard provides.
|
|
|
+and it also stores the lists of pending commits in S3 too. It
|
|
|
+requires a consistent S3 store.
|
|
|
|
|
|
|
|
|
### Core concept: A new/modified output stream for delayed PUT commits
|
|
@@ -1480,7 +1505,7 @@ The time to commit a job will be `O(files/threads)`
|
|
|
Every `.pendingset` file in the job attempt directory must be loaded, and a PUT
|
|
|
request issued for every incomplete upload listed in the files.
|
|
|
|
|
|
-Note that it is the bulk listing of all children which is where full consistency
|
|
|
+[Obsolete] Note that it is the bulk listing of all children which is where full consistency
|
|
|
is required. If instead, the list of files to commit could be returned from
|
|
|
tasks to the job committer, as the Spark commit protocol allows, it would be
|
|
|
possible to commit data to an inconsistent object store.
|
|
@@ -1525,7 +1550,7 @@ commit algorithms.
|
|
|
1. It is possible to create more than one client writing to the
|
|
|
same destination file within the same S3A client/task, either sequentially or in parallel.
|
|
|
|
|
|
-1. Even with a consistent metadata store, if a job overwrites existing
|
|
|
+1. [Obsolete] Even with a consistent metadata store, if a job overwrites existing
|
|
|
files, then old data may still be visible to clients reading the data, until
|
|
|
the update has propagated to all replicas of the data.
|
|
|
|
|
@@ -1538,7 +1563,7 @@ all files in the destination directory which where not being overwritten.
|
|
|
for any purpose other than for the storage of pending commit data.
|
|
|
|
|
|
1. Unless extra code is added to every FS operation, it will still be possible
|
|
|
-to manipulate files under the `__magic` tree. That's not bad, it just potentially
|
|
|
+to manipulate files under the `__magic` tree. That's not bad, just potentially
|
|
|
confusing.
|
|
|
|
|
|
1. As written data is not materialized until the commit, it will not be possible
|
|
@@ -1693,14 +1718,6 @@ base for relative paths created underneath it.
|
|
|
|
|
|
The committers can only be tested against an S3-compatible object store.
|
|
|
|
|
|
-Although a consistent object store is a requirement for a production deployment
|
|
|
-of the magic committer an inconsistent one has appeared to work during testing, simply by
|
|
|
-adding some delays to the operations: a task commit does not succeed until
|
|
|
-all the objects which it has PUT are visible in the LIST operation. Assuming
|
|
|
-that further listings from the same process also show the objects, the job
|
|
|
-committer will be able to list and commit the uploads.
|
|
|
-
|
|
|
-
|
|
|
The committers have some unit tests, and integration tests based on
|
|
|
the protocol integration test lifted from `org.apache.hadoop.mapreduce.lib.output.TestFileOutputCommitter`
|
|
|
to test various state transitions of the commit mechanism has been extended
|
|
@@ -1766,7 +1783,8 @@ tree.
|
|
|
Alternatively, the fact that Spark tasks provide data to the job committer on their
|
|
|
completion means that a list of pending PUT commands could be built up, with the commit
|
|
|
operations being executed by an S3A-specific implementation of the `FileCommitProtocol`.
|
|
|
-As noted earlier, this may permit the requirement for a consistent list operation
|
|
|
+
|
|
|
+[Obsolete] As noted earlier, this may permit the requirement for a consistent list operation
|
|
|
to be bypassed. It would still be important to list what was being written, as
|
|
|
it is needed to aid aborting work in failed tasks, but the list of files
|
|
|
created by successful tasks could be passed directly from the task to committer,
|
|
@@ -1890,9 +1908,6 @@ bandwidth and the data upload bandwidth.
|
|
|
|
|
|
No use is made of the cluster filesystem; there are no risks there.
|
|
|
|
|
|
-A consistent store is required, which, for Amazon's infrastructure, means S3Guard.
|
|
|
-This is covered below.
|
|
|
-
|
|
|
A malicious user with write access to the `__magic` directory could manipulate
|
|
|
or delete the metadata of pending uploads, or potentially inject new work int
|
|
|
the commit. Having access to the `__magic` directory implies write access
|
|
@@ -1900,13 +1915,12 @@ to the parent destination directory: a malicious user could just as easily
|
|
|
manipulate the final output, without needing to attack the committer's intermediate
|
|
|
files.
|
|
|
|
|
|
-
|
|
|
### Security Risks of all committers
|
|
|
|
|
|
|
|
|
#### Visibility
|
|
|
|
|
|
-* If S3Guard is used for storing metadata, then the metadata is visible to
|
|
|
+[Obsolete] If S3Guard is used for storing metadata, then the metadata is visible to
|
|
|
all users with read access. A malicious user with write access could delete
|
|
|
entries of newly generated files, so they would not be visible.
|
|
|
|
|
@@ -1941,7 +1955,7 @@ any of the text fields, script which could then be executed in some XSS
|
|
|
attack. We may wish to consider sanitizing this data on load.
|
|
|
|
|
|
* Paths in tampered data could be modified in an attempt to commit an upload across
|
|
|
-an existing file, or the MPU ID alterated to prematurely commit a different upload.
|
|
|
+an existing file, or the MPU ID altered to prematurely commit a different upload.
|
|
|
These attempts will not going to succeed, because the destination
|
|
|
path of the upload is declared on the initial POST to initiate the MPU, and
|
|
|
operations associated with the MPU must also declare the path: if the path and
|