7 anni fa · 2f50632d87
--- a/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committer_architecture.md
+++ b/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committer_architecture.md
@@ -28,7 +28,7 @@ The standard commit algorithms (the `FileOutputCommitter` and its v1 and v2 algo
 
															 rely on directory rename being an `O(1)` atomic operation: callers output their
														
 
															 work to temporary directories in the destination filesystem, then
														
 
															 rename these directories to the final destination as way of committing work.
														
 
															-This is the perfect solution for commiting work against any filesystem with
														
 
															+This is the perfect solution for committing work against any filesystem with
														
 
															 consistent listing operations and where the `FileSystem.rename()` command
														
 
															 is an atomic `O(1)` operation.
														
@@ -60,7 +60,7 @@ delayed completion of multi-part PUT operations
 
															 That is: tasks write all data as multipart uploads, *but delay the final
														
 
															 commit action until until the final, single job commit action.* Only that
														
 
															 data committed in the job commit action will be made visible; work from speculative
														
 
															-and failed tasks will not be instiantiated. As there is no rename, there is no
														
 
															+and failed tasks will not be instantiated. As there is no rename, there is no
														
 
															 delay while data is copied from a temporary directory to the final directory.
														
 
															 The duration of the commit will be the time needed to determine which commit operations
														
 
															 to construct, and to execute them.
														
@@ -109,7 +109,7 @@ This is traditionally implemented via a `FileSystem.rename()` call.
 
															   It is useful to differentiate between a *task-side commit*: an operation performed
														
 
															   in the task process after its work, and a *driver-side task commit*, in which
														
 
															-  the Job driver perfoms the commit operation. Any task-side commit work will
														
 
															+  the Job driver performs the commit operation. Any task-side commit work will
														
 
															   be performed across the cluster, and may take place off the critical part for
														
 
															   job execution. However, unless the commit protocol requires all tasks to await
														
 
															   a signal from the job driver, task-side commits cannot instantiate their output
														
@@ -241,7 +241,7 @@ def commitTask(fs, jobAttemptPath, taskAttemptPath, dest):
 
															     fs.rename(taskAttemptPath, taskCommittedPath)
														
 
															 ```
														
 
															-On a genuine fileystem this is an `O(1)` directory rename.
														
 
															+On a genuine filesystem this is an `O(1)` directory rename.
														
 
															 On an object store with a mimiced rename, it is `O(data)` for the copy,
														
 
															 along with overhead for listing and deleting all files (For S3, that's
														
@@ -257,13 +257,13 @@ def abortTask(fs, jobAttemptPath, taskAttemptPath, dest):
 
															   fs.delete(taskAttemptPath, recursive=True)
														
 
															 ```
														
 
															-On a genuine fileystem this is an `O(1)` operation. On an object store,
														
 
															+On a genuine filesystem this is an `O(1)` operation. On an object store,
														
 
															 proportional to the time to list and delete files, usually in batches.
														
 
															 ### Job Commit
														
 
															-Merge all files/directories in all task commited paths into final destination path.
														
 
															+Merge all files/directories in all task committed paths into final destination path.
														
 
															 Optionally; create 0-byte `_SUCCESS` file in destination path.
														
 
															 ```python
														
@@ -420,9 +420,9 @@ by renaming the files.
 
															 A a key difference is that the v1 algorithm commits a source directory to
														
 
															 via a directory rename, which is traditionally an `O(1)` operation.
														
 
															-In constrast, the v2 algorithm lists all direct children of a source directory
														
 
															+In contrast, the v2 algorithm lists all direct children of a source directory
														
 
															 and recursively calls `mergePath()` on them, ultimately renaming the individual
														
 
															-files. As such, the number of renames it performa equals the number of source
														
 
															+files. As such, the number of renames it performs equals the number of source
														
 
															 *files*, rather than the number of source *directories*; the number of directory
														
 
															 listings being `O(depth(src))` , where `depth(path)` is a function returning the
														
 
															 depth of directories under the given path.
														
@@ -431,7 +431,7 @@ On a normal filesystem, the v2 merge algorithm is potentially more expensive
 
															 than the v1 algorithm. However, as the merging only takes place in task commit,
														
 
															 it is potentially less of a bottleneck in the entire execution process.
														
 
															-On an objcct store, it is suboptimal not just from its expectation that `rename()`
														
 
															+On an object store, it is suboptimal not just from its expectation that `rename()`
														
 
															 is an `O(1)` operation, but from its expectation that a recursive tree walk is
														
 
															 an efficient way to enumerate and act on a tree of data. If the algorithm was
														
 
															 switched to using `FileSystem.listFiles(path, recursive)` for a single call to
														
@@ -548,7 +548,7 @@ the final destination FS, while `file://` can retain the default
 
															 ### Task Setup
														
 
															-`Task.initialize()`: read in the configuration, instantate the `JobContextImpl`
														
 
															+`Task.initialize()`: read in the configuration, instantiate the `JobContextImpl`
														
 
															 and `TaskAttemptContextImpl` instances bonded to the current job & task.
														
 
															 ### Task Ccommit
														
@@ -610,7 +610,7 @@ deleting the previous attempt's data is straightforward. However, for S3 committ
 
															 using Multipart Upload as the means of uploading uncommitted data, it is critical
														
 
															 to ensure that pending uploads are always aborted. This can be done by
														
 
															-* Making sure that all task-side failure branvches in `Task.done()` call `committer.abortTask()`.
														
 
															+* Making sure that all task-side failure branches in `Task.done()` call `committer.abortTask()`.
														
 
															 * Having job commit & abort cleaning up all pending multipart writes to the same directory
														
 
															 tree. That is: require that no other jobs are writing to the same tree, and so
														
 
															 list all pending operations and cancel them.
														
@@ -653,7 +653,7 @@ rather than relying on fields initiated from the context passed to the construct
 
															 #### AM: Job setup: `OutputCommitter.setupJob()`
														
 
															-This is initated in `org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.StartTransition`.
														
 
															+This is initiated in `org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.StartTransition`.
														
 
															 It is queued for asynchronous execution in `org.apache.hadoop.mapreduce.v2.app.MRAppMaster.startJobs()`,
														
 
															 which is invoked when the service is started. Thus: the job is set up when the
														
 
															 AM is started.
														
@@ -686,7 +686,7 @@ the job is considered not to have attempted to commit itself yet.
 
															 The presence of `COMMIT_SUCCESS` or `COMMIT_FAIL` are taken as evidence
														
 
															-that the previous job completed successfully or unsucessfully; the AM
														
 
															+that the previous job completed successfully or unsuccessfully; the AM
														
 
															 then completes with a success/failure error code, without attempting to rerun
														
 
															 the job.
														
@@ -871,16 +871,16 @@ base directory. As well as translating the write operation, it also supports
 
															 a `getFileStatus()` call on the original path, returning details on the file
														
 
															 at the final destination. This allows for committing applications to verify
														
 
															 the creation/existence/size of the written files (in contrast to the magic
														
 
															-committer covdered below).
														
 
															+committer covered below).
														
 
															 The FS targets Openstack Swift, though other object stores are supportable through
														
 
															 different backends.
														
 
															 This solution is innovative in that it appears to deliver the same semantics
														
 
															 (and hence failure modes) as the Spark Direct OutputCommitter, but which
														
 
															-does not need any changs in either Spark *or* the Hadoop committers. In contrast,
														
 
															+does not need any change in either Spark *or* the Hadoop committers. In contrast,
														
 
															 the committers proposed here combines changing the Hadoop MR committers for
														
 
															-ease of pluggability, and offers a new committer exclusivley for S3, one
														
 
															+ease of pluggability, and offers a new committer exclusively for S3, one
														
 
															 strongly dependent upon and tightly integrated with the S3A Filesystem.
														
 
															 The simplicity of the Stocator committer is something to appreciate.
														
@@ -922,7 +922,7 @@ The completion operation is apparently `O(1)`; presumably the PUT requests
 
															 have already uploaded the data to the server(s) which will eventually be
														
 
															 serving up the data for the final path. All that is needed to complete
														
 
															 the upload is to construct an object by linking together the files in
														
 
															-the server's local filesystem and udate an entry the index table of the
														
 
															+the server's local filesystem and update an entry the index table of the
														
 
															 object store.
														
 
															 In the S3A client, all PUT calls in the sequence and the final commit are
														
@@ -941,11 +941,11 @@ number of appealing features
 
															 The final point is not to be underestimated, es not even
														
 
															 a need for a consistency layer.
														
 
															-* Overall a simpler design.pecially given the need to
														
 
															+* Overall a simpler design. Especially given the need to
														
 
															 be resilient to the various failure modes which may arise.
														
 
															-The commiter writes task outputs to a temporary directory on the local FS.
														
 
															+The committer writes task outputs to a temporary directory on the local FS.
														
 
															 Task outputs are directed to the local FS by `getTaskAttemptPath` and `getWorkPath`.
														
 
															 On task commit, the committer enumerates files in the task attempt directory (ignoring hidden files).
														
 
															 Each file is uploaded to S3 using the [multi-part upload API](http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html),
														
@@ -966,7 +966,7 @@ is a local `file://` reference.
 
															 within a consistent, cluster-wide filesystem. For Netflix, that is HDFS.
														
 
															 1. The Standard `FileOutputCommitter` (algorithm 1) is used to manage the commit/abort of these
														
 
															 files. That is: it copies only those lists of files to commit from successful tasks
														
 
															-into a (transient) job commmit directory.
														
 
															+into a (transient) job commit directory.
														
 
															 1. The S3 job committer reads the pending file list for every task committed
														
 
															 in HDFS, and completes those put requests.
														
@@ -1028,7 +1028,7 @@ complete at or near the same time, there may be a peak of bandwidth load
 
															 slowing down the upload.
														
 
															 Time to commit will be the same, and, given the Netflix committer has already
														
 
															-implemented the paralellization logic here, a time of `O(files/threads)`.
														
 
															+implemented the parallelization logic here, a time of `O(files/threads)`.
														
 
															 ### Resilience
														
@@ -1105,7 +1105,7 @@ This is done by
 
															 an abort of all successfully read files.
														
 
															 1. List and abort all pending multipart uploads.
														
 
															-Because of action #2, action #1 is superflous. It is retained so as to leave
														
 
															+Because of action #2, action #1 is superfluous. It is retained so as to leave
														
 
															 open the option of making action #2 a configurable option -which would be
														
 
															 required to handle the use case of >1 partitioned commit running simultaneously/
														
@@ -1115,7 +1115,7 @@ Because the local data is managed with the v1 commit algorithm, the
 
															 second attempt of the job will recover all the outstanding commit data
														
 
															 of the first attempt; those tasks will not be rerun.
														
 
															-This also ensures that on a job abort, the invidual tasks' .pendingset
														
 
															+This also ensures that on a job abort, the individual tasks' .pendingset
														
 
															 files can be read and used to initiate the abort of those uploads.
														
 
															 That is: a recovered job can clean up the pending writes of the previous job
														
@@ -1129,7 +1129,7 @@ must be configured to automatically delete the pending request.
 
															 Those uploads already executed by a failed job commit will persist; those
														
 
															 yet to execute will remain outstanding.
														
 
															-The committer currently declares itself as non-recoverble, but that
														
 
															+The committer currently declares itself as non-recoverable, but that
														
 
															 may not actually hold, as the recovery process could be one of:
														
 
															 1. Enumerate all job commits from the .pendingset files (*:= Commits*).
														
@@ -1203,7 +1203,7 @@ that of the final job destination. When the job is committed, the pending
 
															 writes are instantiated.
														
 
															 With the addition of the Netflix Staging committer, the actual committer
														
 
															-code now shares common formats for the persistent metadadata and shared routines
														
 
															+code now shares common formats for the persistent metadata and shared routines
														
 
															 for parallel committing of work, including all the error handling based on
														
 
															 the Netflix experience.
														
@@ -1333,7 +1333,7 @@ during job and task committer initialization.
 
															 The job/task commit protocol is expected to handle this with the task
														
 
															 only committing work when the job driver tells it to. A network partition
														
 
															-should trigger the task committer's cancellation of the work (this is a protcol
														
 
															+should trigger the task committer's cancellation of the work (this is a protocol
														
 
															 above the committers).
														
 
															 #### Job Driver failure
														
@@ -1349,7 +1349,7 @@ when the job driver cleans up it will cancel pending writes under the directory.
 
															 #### Multiple jobs targeting the same destination directory
														
 
															-This leaves things in an inderminate state.
														
 
															+This leaves things in an indeterminate state.
														
 
															 #### Failure during task commit
														
@@ -1388,7 +1388,7 @@ Two options present themselves
 
															 and test that code as appropriate.
														
 
															 Fixing the calling code does seem to be the best strategy, as it allows the
														
 
															-failure to be explictly handled in the commit protocol, rather than hidden
														
 
															+failure to be explicitly handled in the commit protocol, rather than hidden
														
 
															 in the committer.::OpenFile
														
 
															 #### Preemption
														
@@ -1418,7 +1418,7 @@ with many millions of objects —rather than list all keys searching for those
 
															 with `/__magic/**/*.pending` in their name, work backwards from the active uploads to
														
 
															 the directories with the data.
														
 
															-We may also want to consider having a cleanup operationn in the S3 CLI to
														
 
															+We may also want to consider having a cleanup operation in the S3 CLI to
														
 
															 do the full tree scan and purge of pending items; give some statistics on
														
 
															 what was found. This will keep costs down and help us identify problems
														
 
															 related to cleanup.
														
@@ -1538,7 +1538,7 @@ The S3A Committer version, would
 
															 In order to support the ubiquitous `FileOutputFormat` and subclasses,
														
 
															 S3A Committers will need somehow be accepted as a valid committer by the class,
														
 
															-a class which explicity expects the output committer to be `FileOutputCommitter`
														
 
															+a class which explicitly expects the output committer to be `FileOutputCommitter`
														
 
															 ```java
														
 
															 public Path getDefaultWorkFile(TaskAttemptContext context,
														
@@ -1555,10 +1555,10 @@ Here are some options which have been considered, explored and discarded
 
															 1. Adding more of a factory mechanism to create `FileOutputCommitter` instances;
														
 
															 subclass this for S3A output and return it. The complexity of `FileOutputCommitter`
														
 
															-and of supporting more dynamic consturction makes this dangerous from an implementation
														
 
															+and of supporting more dynamic construction makes this dangerous from an implementation
														
 
															 and maintenance perspective.
														
 
															-1. Add a new commit algorithmm "3", which actually reads in the configured
														
 
															+1. Add a new commit algorithm "3", which actually reads in the configured
														
 
															 classname of a committer which it then instantiates and then relays the commit
														
 
															 operations, passing in context information. Ths new committer interface would
														
 
															 add methods for methods and attributes. This is viable, but does still change
														
@@ -1695,7 +1695,7 @@ marker implied the classic `FileOutputCommitter` had been used; if it could be r
 
															 then it provides some details on the commit operation which are then used
														
 
															 in assertions in the test suite.
														
 
															-It has since been extended to collet metrics and other values, and has proven
														
 
															+It has since been extended to collect metrics and other values, and has proven
														
 
															 equally useful in Spark integration testing.
														
 
															 ## Integrating the Committers with Apache Spark
														
@@ -1727,8 +1727,8 @@ tree.
 
															 Alternatively, the fact that Spark tasks provide data to the job committer on their
														
 
															 completion means that a list of pending PUT commands could be built up, with the commit
														
 
															-operations being excuted by an S3A-specific implementation of the `FileCommitProtocol`.
														
 
															-As noted earlier, this may permit the reqirement for a consistent list operation
														
 
															+operations being executed by an S3A-specific implementation of the `FileCommitProtocol`.
														
 
															+As noted earlier, this may permit the requirement for a consistent list operation
														
 
															 to be bypassed. It would still be important to list what was being written, as
														
 
															 it is needed to aid aborting work in failed tasks, but the list of files
														
 
															 created by successful tasks could be passed directly from the task to committer,
														
@@ -1833,7 +1833,7 @@ quotas in local FS, keeping temp dirs on different mounted FS from root.
 
															 The intermediate `.pendingset` files are saved in HDFS under the directory in
														
 
															 `fs.s3a.committer.staging.tmp.path`; defaulting to  `/tmp`. This data can
														
 
															 disclose the workflow (it contains the destination paths & amount of data
														
 
															-generated), and if deleted, breaks the job. If malicous code were to edit
														
 
															+generated), and if deleted, breaks the job. If malicious code were to edit
														
 
															 the file, by, for example, reordering the ordered etag list, the generated
														
 
															 data would be committed out of order, creating invalid files. As this is
														
 
															 the (usually transient) cluster FS, any user in the cluster has the potential
														
@@ -1848,7 +1848,7 @@ The directory defined by `fs.s3a.buffer.dir` is used to buffer blocks
 
															 before upload, unless the job is configured to buffer the blocks in memory.
														
 
															 This is as before: no incremental risk. As blocks are deleted from the filesystem
														
 
															 after upload, the amount of storage needed is determined by the data generation
														
 
															-bandwidth and the data upload bandwdith.
														
 
															+bandwidth and the data upload bandwidth.
														
 
															 No use is made of the cluster filesystem; there are no risks there.
														
@@ -1946,6 +1946,6 @@ which will made absolute relative to the current user. In filesystems in
 
															 which access under user's home directories are restricted, this final, absolute
														
 
															 path, will not be visible to untrusted accounts.
														
 
															-* Maybe: define the for valid characters in a text strings, and a regext for
														
 
															+* Maybe: define the for valid characters in a text strings, and a regex for
														
 
															  validating, e,g, `[a-zA-Z0-9 \.\,\(\) \-\+]+` and then validate any free text
														
 
															  JSON fields on load and save.
														
--- a/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md
+++ b/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md
@@ -226,14 +226,14 @@ it is committed through the standard "v1" commit algorithm.
 
															 When the Job is committed, the Job Manager reads the lists of pending writes from its
														
 
															 HDFS Job destination directory and completes those uploads.
														
 
															-Cancelling a task is straighforward: the local directory is deleted with
														
 
															+Cancelling a task is straightforward: the local directory is deleted with
														
 
															 its staged data. Cancelling a job is achieved by reading in the lists of
														
 
															 pending writes from the HDFS job attempt directory, and aborting those
														
 
															 uploads. For extra safety, all outstanding multipart writes to the destination directory
														
 
															 are aborted.
														
 
															 The staging committer comes in two slightly different forms, with slightly
														
 
															-diffrent conflict resolution policies:
														
 
															+different conflict resolution policies:
														
 
															 * **Directory**: the entire directory tree of data is written or overwritten,
														
@@ -278,7 +278,7 @@ any with the same name. Reliable use requires unique names for generated files,
 
															 which the committers generate
														
 
															 by default.
														
 
															-The difference between the two staging ommitters are as follows:
														
 
															+The difference between the two staging committers are as follows:
														
 
															 The Directory Committer uses the entire directory tree for conflict resolution.
														
 
															 If any file exists at the destination it will fail in job setup; if the resolution
														
@@ -301,7 +301,7 @@ It's intended for use in Apache Spark Dataset operations, rather
 
															 than Hadoop's original MapReduce engine, and only in jobs
														
 
															 where adding new data to an existing dataset is the desired goal.
														
 
															-Preequisites for successful work
														
 
															+Prerequisites for successful work
														
 
															 1. The output is written into partitions via `PARTITIONED BY` or `partitionedBy()`
														
 
															 instructions.
														
@@ -401,7 +401,7 @@ Generated files are initially written to a local directory underneath one of the
 
															 directories listed in `fs.s3a.buffer.dir`.
														
 
															-The staging commmitter needs a path in the cluster filesystem
														
 
															+The staging committer needs a path in the cluster filesystem
														
 
															 (e.g. HDFS). This must be declared in `fs.s3a.committer.staging.tmp.path`.
														
 
															 Temporary files are saved in HDFS (or other cluster filesystem) under the path
														
@@ -460,7 +460,7 @@ What the partitioned committer does is, where the tooling permits, allows caller
 
															 to add data to an existing partitioned layout*.
														
 
															 More specifically, it does this by having a conflict resolution options which
														
 
															-only act on invididual partitions, rather than across the entire output tree.
														
 
															+only act on individual partitions, rather than across the entire output tree.
														
 
															 | `fs.s3a.committer.staging.conflict-mode` | Meaning |
														
 
															 | -----------------------------------------|---------|
														
@@ -508,7 +508,7 @@ documentation to see if it is consistent, hence compatible "out of the box".
 
															 <property>
														
 
															   <name>fs.s3a.committer.magic.enabled</name>
														
 
															   <description>
														
 
															-  Enable support in the filesystem for the S3 "Magic" committter.
														
 
															+  Enable support in the filesystem for the S3 "Magic" committer.
														
 
															   </description>
														
 
															   <value>true</value>
														
 
															 </property>
														
@@ -706,7 +706,7 @@ This message should not appear through the committer itself &mdash;it will
 
															 fail with the error message in the previous section, but may arise
														
 
															 if other applications are attempting to create files under the path `/__magic/`.
														
 
															-Make sure the filesytem meets the requirements of the magic committer
														
 
															+Make sure the filesystem meets the requirements of the magic committer
														
 
															 (a consistent S3A filesystem through S3Guard or the S3 service itself),
														
 
															 and set the `fs.s3a.committer.magic.enabled` flag to indicate that magic file
														
 
															 writes are supported.
														
@@ -741,7 +741,7 @@ at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.
 
															 While that will not make the problem go away, it will at least make
														
 
															 the failure happen at the start of a job.
														
 
															-(Setting this option will not interfer with the Staging Committers' use of HDFS,
														
 
															+(Setting this option will not interfere with the Staging Committers' use of HDFS,
														
 
															 as it explicitly sets the algorithm to "2" for that part of its work).
														
 
															 The other way to check which committer to use is to examine the `_SUCCESS` file.
														
--- a/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/encryption.md
+++ b/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/encryption.md
@@ -23,7 +23,7 @@
 
															 The S3A filesystem client supports Amazon S3's Server Side Encryption
														
 
															 for at-rest data encryption.
														
 
															 You should to read up on the [AWS documentation](https://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html)
														
 
															-for S3 Server Side Encryption for up to date information on the encryption mechansims.
														
 
															+for S3 Server Side Encryption for up to date information on the encryption mechanisms.
														
@@ -135,7 +135,7 @@ it blank to use the default configured for that region.
 
															 the right to use it, uses it to encrypt the object-specific key.
														
 
															-When downloading SSE-KMS encrypte data, the sequence is as follows
														
 
															+When downloading SSE-KMS encrypted data, the sequence is as follows
														
 
															 1. The S3A client issues an HTTP GET request to read the data.
														
 
															 1. S3 sees that the data was encrypted with SSE-KMS, and looks up the specific key in the KMS service
														
@@ -413,8 +413,8 @@ a KMS key hosted in the AWS-KMS service in the same region.
 
															 ```
														
 
															-Again the approprate bucket policy can be used to guarantee that all callers
														
 
															-will use SSE-KMS; they can even mandata the name of the key used to encrypt
														
 
															+Again the appropriate bucket policy can be used to guarantee that all callers
														
 
															+will use SSE-KMS; they can even mandate the name of the key used to encrypt
														
 
															 the data, so guaranteeing that access to thee data can be read by everyone
														
 
															 granted access to that key, and nobody without access to it.
														
--- a/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md
+++ b/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md
@@ -638,7 +638,7 @@ over that of the `hadoop.security` list (i.e. they are prepended to the common l
 
															 </property>
														
 
															 ```
														
 
															-This was added to suppport binding different credential providers on a per
														
 
															+This was added to support binding different credential providers on a per
														
 
															 bucket basis, without adding alternative secrets in the credential list.
														
 
															 However, some applications (e.g Hive) prevent the list of credential providers
														
 
															 from being dynamically updated by users. As per-bucket secrets are now supported,
														
@@ -938,7 +938,7 @@ The S3A client makes a best-effort attempt at recovering from network failures;
 
															 this section covers the details of what it does.
														
 
															 The S3A divides exceptions returned by the AWS SDK into different categories,
														
 
															-and chooses a differnt retry policy based on their type and whether or
														
 
															+and chooses a different retry policy based on their type and whether or
														
 
															 not the failing operation is idempotent.
														
@@ -969,7 +969,7 @@ These failures will be retried with a fixed sleep interval set in
 
															 `fs.s3a.retry.interval`, up to the limit set in `fs.s3a.retry.limit`.
														
 
															-### Only retrible on idempotent operations
														
 
															+### Only retriable on idempotent operations
														
 
															 Some network failures are considered to be retriable if they occur on
														
 
															 idempotent operations; there's no way to know if they happened
														
@@ -997,11 +997,11 @@ it's a no-op if reprocessed. As indeed, is `Filesystem.delete()`.
 
															 1. Any filesystem supporting an atomic `FileSystem.create(path, overwrite=false)`
														
 
															 operation to reject file creation if the path exists MUST NOT consider
														
 
															 delete to be idempotent, because a `create(path, false)` operation will
														
 
															-only succeed if the first `delete()` call has already succeded.
														
 
															+only succeed if the first `delete()` call has already succeeded.
														
 
															 1. And a second, retried `delete()` call could delete the new data.
														
 
															-Because S3 is eventially consistent *and* doesn't support an
														
 
															-atomic create-no-overwrite operation, the choice is more ambigious.
														
 
															+Because S3 is eventually consistent *and* doesn't support an
														
 
															+atomic create-no-overwrite operation, the choice is more ambiguous.
														
 
															 Currently S3A considers delete to be
														
 
															 idempotent because it is convenient for many workflows, including the
														
@@ -1045,11 +1045,11 @@ Notes
 
															 1. There is also throttling taking place inside the AWS SDK; this is managed
														
 
															 by the value `fs.s3a.attempts.maximum`.
														
 
															 1. Throttling events are tracked in the S3A filesystem metrics and statistics.
														
 
															-1. Amazon KMS may thottle a customer based on the total rate of uses of
														
 
															+1. Amazon KMS may throttle a customer based on the total rate of uses of
														
 
															 KMS *across all user accounts and applications*.
														
 
															 Throttling of S3 requests is all too common; it is caused by too many clients
														
 
															-trying to access the same shard of S3 Storage. This generatlly
														
 
															+trying to access the same shard of S3 Storage. This generally
														
 
															 happen if there are too many reads, those being the most common in Hadoop
														
 
															 applications. This problem is exacerbated by Hive's partitioning
														
 
															 strategy used when storing data, such as partitioning by year and then month.
														
@@ -1087,7 +1087,7 @@ of data asked for in every GET request, as well as how much data is
 
															 skipped in the existing stream before aborting it and creating a new stream.
														
 
															 1. If the DynamoDB tables used by S3Guard are being throttled, increase
														
 
															 the capacity through `hadoop s3guard set-capacity` (and pay more, obviously).
														
 
															-1. KMS: "consult AWS about increating your capacity".
														
 
															+1. KMS: "consult AWS about increasing your capacity".
														
@@ -1173,14 +1173,14 @@ fs.s3a.bucket.nightly.server-side-encryption-algorithm
 
															 ```
														
 
															 When accessing the bucket `s3a://nightly/`, the per-bucket configuration
														
 
															-options for that backet will be used, here the access keys and token,
														
 
															+options for that bucket will be used, here the access keys and token,
														
 
															 and including the encryption algorithm and key.
														
 
															 ###  <a name="per_bucket_endpoints"></a>Using Per-Bucket Configuration to access data round the world
														
 
															 S3 Buckets are hosted in different "regions", the default being "US-East".
														
 
															-The S3A client talks to this region by default, issing HTTP requests
														
 
															+The S3A client talks to this region by default, issuing HTTP requests
														
 
															 to the server `s3.amazonaws.com`.
														
 
															 S3A can work with buckets from any region. Each region has its own
														
@@ -1331,12 +1331,12 @@ The "fast" output stream
 
															     to the available disk space.
														
 
															 1.  Generates output statistics as metrics on the filesystem, including
														
 
															     statistics of active and pending block uploads.
														
 
															-1.  Has the time to `close()` set by the amount of remaning data to upload, rather
														
 
															+1.  Has the time to `close()` set by the amount of remaining data to upload, rather
														
 
															     than the total size of the file.
														
 
															 Because it starts uploading while data is still being written, it offers
														
 
															 significant benefits when very large amounts of data are generated.
														
 
															-The in memory buffering mechanims may also offer speedup when running adjacent to
														
 
															+The in memory buffering mechanisms may also offer speedup when running adjacent to
														
 
															 S3 endpoints, as disks are not used for intermediate data storage.
														
@@ -1400,7 +1400,7 @@ upload operation counts, so identifying when there is a backlog of work/
 
															 a mismatch between data generation rates and network bandwidth. Per-stream
														
 
															 statistics can also be logged by calling `toString()` on the current stream.
														
 
															-* Files being written are still invisible untl the write
														
 
															+* Files being written are still invisible until the write
														
 
															 completes in the `close()` call, which will block until the upload is completed.
														
@@ -1526,7 +1526,7 @@ compete with other filesystem operations.
 
															 We recommend a low value of `fs.s3a.fast.upload.active.blocks`; enough
														
 
															 to start background upload without overloading other parts of the system,
														
 
															-then experiment to see if higher values deliver more throughtput —especially
														
 
															+then experiment to see if higher values deliver more throughput —especially
														
 
															 from VMs running on EC2.
														
 
															 ```xml
														
@@ -1569,10 +1569,10 @@ from VMs running on EC2.
 
															 There are two mechanisms for cleaning up after leftover multipart
														
 
															 uploads:
														
 
															 - Hadoop s3guard CLI commands for listing and deleting uploads by their
														
 
															-age. Doumented in the [S3Guard](./s3guard.html) section.
														
 
															+age. Documented in the [S3Guard](./s3guard.html) section.
														
 
															 - The configuration parameter `fs.s3a.multipart.purge`, covered below.
														
 
															-If an large stream writeoperation is interrupted, there may be
														
 
															+If a large stream write operation is interrupted, there may be
														
 
															 intermediate partitions uploaded to S3 —data which will be billed for.
														
 
															 These charges can be reduced by enabling `fs.s3a.multipart.purge`,
														
--- a/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3guard.md
+++ b/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3guard.md
@@ -506,7 +506,7 @@ Input seek policy: fs.s3a.experimental.input.fadvise=normal
 
															 Note that other clients may have a S3Guard table set up to store metadata
														
 
															 on this bucket; the checks are all done from the perspective of the configuration
														
 
															-setttings of the current client.
														
 
															+settings of the current client.
														
 
															 ```bash
														
 
															 hadoop s3guard bucket-info -guarded -auth s3a://landsat-pds
														
@@ -798,6 +798,6 @@ The IO load of clients of the (shared) DynamoDB table was exceeded.
 
															 Currently S3Guard doesn't do any throttling and retries here; the way to address
														
 
															 this is to increase capacity via the AWS console or the `set-capacity` command.
														
 
															-## Other Topis
														
 
															+## Other Topics
														
 
															 For details on how to test S3Guard, see [Testing S3Guard](./testing.html#s3guard)
														
--- a/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md
+++ b/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md
@@ -28,7 +28,7 @@ be ignored.
 
															 ## <a name="policy"></a> Policy for submitting patches which affect the `hadoop-aws` module.
														
 
															-The Apache Jenkins infrastucture does not run any S3 integration tests,
														
 
															+The Apache Jenkins infrastructure does not run any S3 integration tests,
														
 
															 due to the need to keep credentials secure.
														
 
															 ### The submitter of any patch is required to run all the integration tests and declare which S3 region/implementation they used.
														
@@ -319,10 +319,10 @@ mvn verify -Dparallel-tests -Dscale -DtestsThreadCount=8
 
															 The most bandwidth intensive tests (those which upload data) always run
														
 
															 sequentially; those which are slow due to HTTPS setup costs or server-side
														
 
															-actionsare included in the set of parallelized tests.
														
 
															+actions are included in the set of parallelized tests.
														
 
															-### <a name="tuning_scale"></a> Tuning scale optins from Maven
														
 
															+### <a name="tuning_scale"></a> Tuning scale options from Maven
														
 
															 Some of the tests can be tuned from the maven build or from the
														
@@ -344,7 +344,7 @@ then the configuration value is used. The `unset` option is used to
 
															 Only a few properties can be set this way; more will be added.
														
 
															-| Property | Meaninging |
														
 
															+| Property | Meaning |
														
 
															 |-----------|-------------|
														
 
															 | `fs.s3a.scale.test.timeout`| Timeout in seconds for scale tests |
														
 
															 | `fs.s3a.scale.test.huge.filesize`| Size for huge file uploads |
														
@@ -493,7 +493,7 @@ cases must be disabled:
 
															   <value>false</value>
														
 
															 </property>
														
 
															 ```
														
 
															-These tests reqest a temporary set of credentials from the STS service endpoint.
														
 
															+These tests request a temporary set of credentials from the STS service endpoint.
														
 
															 An alternate endpoint may be defined in `test.fs.s3a.sts.endpoint`.
														
 
															 ```xml
														
@@ -641,7 +641,7 @@ to support the declaration of a specific large test file on alternate filesystem
 
															 ### Works Over Long-haul Links
														
 
															-As well as making file size and operation counts scaleable, this includes
														
 
															+As well as making file size and operation counts scalable, this includes
														
 
															 making test timeouts adequate. The Scale tests make this configurable; it's
														
 
															 hard coded to ten minutes in `AbstractS3ATestBase()`; subclasses can
														
 
															 change this by overriding `getTestTimeoutMillis()`.
														
@@ -677,7 +677,7 @@ Tests can overrun `createConfiguration()` to add new options to the configuratio
 
															 file for the S3A Filesystem instance used in their tests.
														
 
															 However, filesystem caching may mean that a test suite may get a cached
														
 
															-instance created with an differennnt configuration. For tests which don't need
														
 
															+instance created with an different configuration. For tests which don't need
														
 
															 specific configurations caching is good: it reduces test setup time.
														
 
															 For those tests which do need unique options (encryption, magic files),
														
@@ -888,7 +888,7 @@ s3a://bucket/a/b/c/DELAY_LISTING_ME
 
															 ```
														
 
															 In real-life S3 inconsistency, however, we expect that all the above paths
														
 
															-(including `a` and `b`) will be subject to delayed visiblity.
														
 
															+(including `a` and `b`) will be subject to delayed visibility.
														
 
															 ### Using the `InconsistentAmazonS3CClient` in downstream integration tests
														
@@ -952,7 +952,7 @@ When the `s3guard` profile is enabled, following profiles can be specified:
 
															   DynamoDB web service; launch the server and creating the table.
														
 
															   You won't be charged bills for using DynamoDB in test. As it runs in-JVM,
														
 
															   the table isn't shared across other tests running in parallel.
														
 
															-* `non-auth`: treat the S3Guard metadata as authorative.
														
 
															+* `non-auth`: treat the S3Guard metadata as authoritative.
														
 
															 ```bash
														
 
															 mvn -T 1C verify -Dparallel-tests -DtestsThreadCount=6 -Ds3guard -Ddynamo -Dauth
														
@@ -984,7 +984,7 @@ throttling, and compare performance for different implementations. These
 
															 are included in the scale tests executed when `-Dscale` is passed to
														
 
															 the maven command line.
														
 
															-The two S3Guard scale testse are `ITestDynamoDBMetadataStoreScale` and
														
 
															+The two S3Guard scale tests are `ITestDynamoDBMetadataStoreScale` and
														
 
															 `ITestLocalMetadataStoreScale`.  To run the DynamoDB test, you will need to
														
 
															 define your table name and region in your test configuration.  For example,
														
 
															 the following settings allow us to run `ITestDynamoDBMetadataStoreScale` with
														
--- a/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md
+++ b/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md
@@ -967,7 +967,7 @@ Again, this is due to the fact that the data is cached locally until the
 
															 `close()` operation. The S3A filesystem cannot be used as a store of data
														
 
															 if it is required that the data is persisted durably after every
														
 
															 `Syncable.hflush()` or `Syncable.hsync()` call.
														
 
															-This includes resilient logging, HBase-style journalling
														
 
															+This includes resilient logging, HBase-style journaling
														
 
															 and the like. The standard strategy here is to save to HDFS and then copy to S3.
														
 
															 ## <a name="encryption"></a> S3 Server Side Encryption