|
@@ -15,14 +15,16 @@
|
|
|
|
|
|
# The Manifest Committer for Azure and Google Cloud Storage
|
|
|
|
|
|
-This document how to use the _Manifest Committer_.
|
|
|
+<!-- MACRO{toc|fromDepth=0|toDepth=2} -->
|
|
|
+
|
|
|
+This documents how to use the _Manifest Committer_.
|
|
|
|
|
|
The _Manifest_ committer is a committer for work which provides
|
|
|
performance on ABFS for "real world" queries,
|
|
|
and performance and correctness on GCS.
|
|
|
It also works with other filesystems, including HDFS.
|
|
|
However, the design is optimized for object stores where
|
|
|
-listing operatons are slow and expensive.
|
|
|
+listing operations are slow and expensive.
|
|
|
|
|
|
The architecture and implementation of the committer is covered in
|
|
|
[Manifest Committer Architecture](manifest_committer_architecture.html).
|
|
@@ -31,10 +33,16 @@ The architecture and implementation of the committer is covered in
|
|
|
The protocol and its correctness are covered in
|
|
|
[Manifest Committer Protocol](manifest_committer_protocol.html).
|
|
|
|
|
|
-It was added in March 2022, and should be considered unstable
|
|
|
-in early releases.
|
|
|
+It was added in March 2022.
|
|
|
+As of April 2024, the problems which surfaced have been
|
|
|
+* Memory use at scale.
|
|
|
+* Directory deletion scalability.
|
|
|
+* Resilience to task commit to rename failures.
|
|
|
|
|
|
-<!-- MACRO{toc|fromDepth=0|toDepth=2} -->
|
|
|
+That is: the core algorithms is correct, but task commit
|
|
|
+robustness was insufficient to some failure conditions.
|
|
|
+And scale is always a challenge, even with components tested through
|
|
|
+large TPC-DS test runs.
|
|
|
|
|
|
## Problem:
|
|
|
|
|
@@ -70,10 +78,13 @@ This committer uses the extension point which came in for the S3A committers.
|
|
|
Users can declare a new committer factory for abfs:// and gcs:// URLs.
|
|
|
A suitably configured spark deployment will pick up the new committer.
|
|
|
|
|
|
-Directory performance issues in job cleanup can be addressed by two options
|
|
|
+Directory performance issues in job cleanup can be addressed by some options
|
|
|
1. The committer will parallelize deletion of task attempt directories before
|
|
|
deleting the `_temporary` directory.
|
|
|
-1. Cleanup can be disabled. .
|
|
|
+2. An initial attempt to delete the `_temporary` directory before the parallel
|
|
|
+ attempt is made.
|
|
|
+3. Exceptions can be supressed, so that cleanup failures do not fail the job
|
|
|
+4. Cleanup can be disabled.
|
|
|
|
|
|
The committer can be used with any filesystem client which has a "real" file rename()
|
|
|
operation.
|
|
@@ -112,8 +123,8 @@ These can be done in `core-site.xml`, if it is not defined in the `mapred-defaul
|
|
|
|
|
|
## Binding to the manifest committer in Spark.
|
|
|
|
|
|
-In Apache Spark, the configuration can be done either with command line options (after the '--conf') or by using the `spark-defaults.conf` file. The following is an example of using `spark-defaults.conf` also including the configuration for Parquet with a subclass of the parquet
|
|
|
-committer which uses the factory mechansim internally.
|
|
|
+In Apache Spark, the configuration can be done either with command line options (after the `--conf`) or by using the `spark-defaults.conf` file.
|
|
|
+The following is an example of using `spark-defaults.conf` also including the configuration for Parquet with a subclass of the parquet committer which uses the factory mechanism internally.
|
|
|
|
|
|
```
|
|
|
spark.hadoop.mapreduce.outputcommitter.factory.scheme.abfs org.apache.hadoop.fs.azurebfs.commit.AzureManifestCommitterFactory
|
|
@@ -184,6 +195,7 @@ Here are the main configuration options of the committer.
|
|
|
| `mapreduce.manifest.committer.io.threads` | Thread count for parallel operations | `64` |
|
|
|
| `mapreduce.manifest.committer.summary.report.directory` | directory to save reports. | `""` |
|
|
|
| `mapreduce.manifest.committer.cleanup.parallel.delete` | Delete temporary directories in parallel | `true` |
|
|
|
+| `mapreduce.manifest.committer.cleanup.parallel.delete.base.first` | Attempt to delete the base directory before parallel task attempts | `false` |
|
|
|
| `mapreduce.fileoutputcommitter.cleanup.skipped` | Skip cleanup of `_temporary` directory| `false` |
|
|
|
| `mapreduce.fileoutputcommitter.cleanup-failures.ignored` | Ignore errors during cleanup | `false` |
|
|
|
| `mapreduce.fileoutputcommitter.marksuccessfuljobs` | Create a `_SUCCESS` marker file on successful completion. (and delete any existing one in job setup) | `true` |
|
|
@@ -238,37 +250,6 @@ Caveats
|
|
|
are made against the store. The rate throttling option
|
|
|
`mapreduce.manifest.committer.io.rate` can help avoid this.
|
|
|
|
|
|
-
|
|
|
-### `mapreduce.manifest.committer.writer.queue.capacity`
|
|
|
-
|
|
|
-This is a secondary scale option.
|
|
|
-It controls the size of the queue for storing lists of files to rename from
|
|
|
-the manifests loaded from the target filesystem, manifests loaded
|
|
|
-from a pool of worker threads, and the single thread which saves
|
|
|
-the entries from each manifest to an intermediate file in the local filesystem.
|
|
|
-
|
|
|
-Once the queue is full, all manifest loading threads will block.
|
|
|
-
|
|
|
-```xml
|
|
|
-<property>
|
|
|
- <name>mapreduce.manifest.committer.writer.queue.capacity</name>
|
|
|
- <value>32</value>
|
|
|
-</property>
|
|
|
-```
|
|
|
-
|
|
|
-As the local filesystem is usually much faster to write to than any cloud store,
|
|
|
-this queue size should not be a limit on manifest load performance.
|
|
|
-
|
|
|
-It can help limit the amount of memory consumed during manifest load during
|
|
|
-job commit.
|
|
|
-The maximum number of loaded manifests will be:
|
|
|
-
|
|
|
-```
|
|
|
-mapreduce.manifest.committer.writer.queue.capacity + mapreduce.manifest.committer.io.threads
|
|
|
-```
|
|
|
-
|
|
|
-
|
|
|
-
|
|
|
## <a name="deleting"></a> Optional: deleting target files in Job Commit
|
|
|
|
|
|
The classic `FileOutputCommitter` deletes files at the destination paths
|
|
@@ -403,6 +384,153 @@ hadoop org.apache.hadoop.mapreduce.lib.output.committer.manifest.files.ManifestP
|
|
|
This works for the files saved at the base of an output directory, and
|
|
|
any reports saved to a report directory.
|
|
|
|
|
|
+Example from a run of the `ITestAbfsTerasort` MapReduce terasort.
|
|
|
+
|
|
|
+```
|
|
|
+bin/mapred successfile abfs://testing@ukwest.dfs.core.windows.net/terasort/_SUCCESS
|
|
|
+
|
|
|
+Manifest file: abfs://testing@ukwest.dfs.core.windows.net/terasort/_SUCCESS
|
|
|
+succeeded: true
|
|
|
+created: 2024-04-18T18:34:34.003+01:00[Europe/London]
|
|
|
+committer: org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitter
|
|
|
+hostname: pi5
|
|
|
+jobId: job_1713461587013_0003
|
|
|
+jobIdSource: JobID
|
|
|
+Diagnostics
|
|
|
+ mapreduce.manifest.committer.io.threads = 192
|
|
|
+ principal = alice
|
|
|
+ stage = committer_commit_job
|
|
|
+
|
|
|
+Statistics:
|
|
|
+counters=((commit_file_rename=1)
|
|
|
+(committer_bytes_committed=21)
|
|
|
+(committer_commit_job=1)
|
|
|
+(committer_files_committed=1)
|
|
|
+(committer_task_directory_depth=2)
|
|
|
+(committer_task_file_count=2)
|
|
|
+(committer_task_file_size=21)
|
|
|
+(committer_task_manifest_file_size=37157)
|
|
|
+(job_stage_cleanup=1)
|
|
|
+(job_stage_create_target_dirs=1)
|
|
|
+(job_stage_load_manifests=1)
|
|
|
+(job_stage_optional_validate_output=1)
|
|
|
+(job_stage_rename_files=1)
|
|
|
+(job_stage_save_success_marker=1)
|
|
|
+(job_stage_setup=1)
|
|
|
+(op_create_directories=1)
|
|
|
+(op_delete=3)
|
|
|
+(op_delete_dir=1)
|
|
|
+(op_get_file_status=9)
|
|
|
+(op_get_file_status.failures=6)
|
|
|
+(op_list_status=3)
|
|
|
+(op_load_all_manifests=1)
|
|
|
+(op_load_manifest=2)
|
|
|
+(op_mkdirs=4)
|
|
|
+(op_msync=1)
|
|
|
+(op_rename=2)
|
|
|
+(op_rename.failures=1)
|
|
|
+(task_stage_commit=2)
|
|
|
+(task_stage_save_task_manifest=1)
|
|
|
+(task_stage_scan_directory=2)
|
|
|
+(task_stage_setup=2));
|
|
|
+
|
|
|
+gauges=();
|
|
|
+
|
|
|
+minimums=((commit_file_rename.min=141)
|
|
|
+(committer_commit_job.min=2306)
|
|
|
+(committer_task_directory_count=0)
|
|
|
+(committer_task_directory_depth=1)
|
|
|
+(committer_task_file_count=0)
|
|
|
+(committer_task_file_size=0)
|
|
|
+(committer_task_manifest_file_size=18402)
|
|
|
+(job_stage_cleanup.min=196)
|
|
|
+(job_stage_create_target_dirs.min=2)
|
|
|
+(job_stage_load_manifests.min=687)
|
|
|
+(job_stage_optional_validate_output.min=66)
|
|
|
+(job_stage_rename_files.min=161)
|
|
|
+(job_stage_save_success_marker.min=653)
|
|
|
+(job_stage_setup.min=571)
|
|
|
+(op_create_directories.min=1)
|
|
|
+(op_delete.min=57)
|
|
|
+(op_delete_dir.min=129)
|
|
|
+(op_get_file_status.failures.min=57)
|
|
|
+(op_get_file_status.min=55)
|
|
|
+(op_list_status.min=202)
|
|
|
+(op_load_all_manifests.min=445)
|
|
|
+(op_load_manifest.min=171)
|
|
|
+(op_mkdirs.min=67)
|
|
|
+(op_msync.min=0)
|
|
|
+(op_rename.failures.min=266)
|
|
|
+(op_rename.min=139)
|
|
|
+(task_stage_commit.min=206)
|
|
|
+(task_stage_save_task_manifest.min=651)
|
|
|
+(task_stage_scan_directory.min=206)
|
|
|
+(task_stage_setup.min=127));
|
|
|
+
|
|
|
+maximums=((commit_file_rename.max=141)
|
|
|
+(committer_commit_job.max=2306)
|
|
|
+(committer_task_directory_count=0)
|
|
|
+(committer_task_directory_depth=1)
|
|
|
+(committer_task_file_count=1)
|
|
|
+(committer_task_file_size=21)
|
|
|
+(committer_task_manifest_file_size=18755)
|
|
|
+(job_stage_cleanup.max=196)
|
|
|
+(job_stage_create_target_dirs.max=2)
|
|
|
+(job_stage_load_manifests.max=687)
|
|
|
+(job_stage_optional_validate_output.max=66)
|
|
|
+(job_stage_rename_files.max=161)
|
|
|
+(job_stage_save_success_marker.max=653)
|
|
|
+(job_stage_setup.max=571)
|
|
|
+(op_create_directories.max=1)
|
|
|
+(op_delete.max=113)
|
|
|
+(op_delete_dir.max=129)
|
|
|
+(op_get_file_status.failures.max=231)
|
|
|
+(op_get_file_status.max=61)
|
|
|
+(op_list_status.max=300)
|
|
|
+(op_load_all_manifests.max=445)
|
|
|
+(op_load_manifest.max=436)
|
|
|
+(op_mkdirs.max=123)
|
|
|
+(op_msync.max=0)
|
|
|
+(op_rename.failures.max=266)
|
|
|
+(op_rename.max=139)
|
|
|
+(task_stage_commit.max=302)
|
|
|
+(task_stage_save_task_manifest.max=651)
|
|
|
+(task_stage_scan_directory.max=302)
|
|
|
+(task_stage_setup.max=157));
|
|
|
+
|
|
|
+means=((commit_file_rename.mean=(samples=1, sum=141, mean=141.0000))
|
|
|
+(committer_commit_job.mean=(samples=1, sum=2306, mean=2306.0000))
|
|
|
+(committer_task_directory_count=(samples=4, sum=0, mean=0.0000))
|
|
|
+(committer_task_directory_depth=(samples=2, sum=2, mean=1.0000))
|
|
|
+(committer_task_file_count=(samples=4, sum=2, mean=0.5000))
|
|
|
+(committer_task_file_size=(samples=2, sum=21, mean=10.5000))
|
|
|
+(committer_task_manifest_file_size=(samples=2, sum=37157, mean=18578.5000))
|
|
|
+(job_stage_cleanup.mean=(samples=1, sum=196, mean=196.0000))
|
|
|
+(job_stage_create_target_dirs.mean=(samples=1, sum=2, mean=2.0000))
|
|
|
+(job_stage_load_manifests.mean=(samples=1, sum=687, mean=687.0000))
|
|
|
+(job_stage_optional_validate_output.mean=(samples=1, sum=66, mean=66.0000))
|
|
|
+(job_stage_rename_files.mean=(samples=1, sum=161, mean=161.0000))
|
|
|
+(job_stage_save_success_marker.mean=(samples=1, sum=653, mean=653.0000))
|
|
|
+(job_stage_setup.mean=(samples=1, sum=571, mean=571.0000))
|
|
|
+(op_create_directories.mean=(samples=1, sum=1, mean=1.0000))
|
|
|
+(op_delete.mean=(samples=3, sum=240, mean=80.0000))
|
|
|
+(op_delete_dir.mean=(samples=1, sum=129, mean=129.0000))
|
|
|
+(op_get_file_status.failures.mean=(samples=6, sum=614, mean=102.3333))
|
|
|
+(op_get_file_status.mean=(samples=3, sum=175, mean=58.3333))
|
|
|
+(op_list_status.mean=(samples=3, sum=671, mean=223.6667))
|
|
|
+(op_load_all_manifests.mean=(samples=1, sum=445, mean=445.0000))
|
|
|
+(op_load_manifest.mean=(samples=2, sum=607, mean=303.5000))
|
|
|
+(op_mkdirs.mean=(samples=4, sum=361, mean=90.2500))
|
|
|
+(op_msync.mean=(samples=1, sum=0, mean=0.0000))
|
|
|
+(op_rename.failures.mean=(samples=1, sum=266, mean=266.0000))
|
|
|
+(op_rename.mean=(samples=1, sum=139, mean=139.0000))
|
|
|
+(task_stage_commit.mean=(samples=2, sum=508, mean=254.0000))
|
|
|
+(task_stage_save_task_manifest.mean=(samples=1, sum=651, mean=651.0000))
|
|
|
+(task_stage_scan_directory.mean=(samples=2, sum=508, mean=254.0000))
|
|
|
+(task_stage_setup.mean=(samples=2, sum=284, mean=142.0000)));
|
|
|
+
|
|
|
+```
|
|
|
+
|
|
|
## <a name="summaries"></a> Collecting Job Summaries `mapreduce.manifest.committer.summary.report.directory`
|
|
|
|
|
|
The committer can be configured to save the `_SUCCESS` summary files to a report directory,
|
|
@@ -431,46 +559,62 @@ This allows for the statistics of jobs to be collected irrespective of their out
|
|
|
saving the `_SUCCESS` marker is enabled, and without problems caused by a chain of queries
|
|
|
overwriting the markers.
|
|
|
|
|
|
+The `mapred successfile` operation can be used to print these reports.
|
|
|
|
|
|
# <a name="cleanup"></a> Cleanup
|
|
|
|
|
|
Job cleanup is convoluted as it is designed to address a number of issues which
|
|
|
may surface in cloud storage.
|
|
|
|
|
|
-* Slow performance for deletion of directories.
|
|
|
-* Timeout when deleting very deep and wide directory trees.
|
|
|
+* Slow performance for deletion of directories (GCS).
|
|
|
+* Timeout when deleting very deep and wide directory trees (Azure).
|
|
|
* General resilience to cleanup issues escalating to job failures.
|
|
|
|
|
|
|
|
|
-| Option | Meaning | Default Value |
|
|
|
-|--------|---------|---------------|
|
|
|
-| `mapreduce.fileoutputcommitter.cleanup.skipped` | Skip cleanup of `_temporary` directory| `false` |
|
|
|
-| `mapreduce.fileoutputcommitter.cleanup-failures.ignored` | Ignore errors during cleanup | `false` |
|
|
|
-| `mapreduce.manifest.committer.cleanup.parallel.delete` | Delete task attempt directories in parallel | `true` |
|
|
|
+| Option | Meaning | Default Value |
|
|
|
+|-------------------------------------------------------------------|--------------------------------------------------------------------|---------------|
|
|
|
+| `mapreduce.fileoutputcommitter.cleanup.skipped` | Skip cleanup of `_temporary` directory | `false` |
|
|
|
+| `mapreduce.fileoutputcommitter.cleanup-failures.ignored` | Ignore errors during cleanup | `false` |
|
|
|
+| `mapreduce.manifest.committer.cleanup.parallel.delete` | Delete task attempt directories in parallel | `true` |
|
|
|
+| `mapreduce.manifest.committer.cleanup.parallel.delete.base.first` | Attempt to delete the base directory before parallel task attempts | `false` |
|
|
|
|
|
|
The algorithm is:
|
|
|
|
|
|
-```
|
|
|
-if `mapreduce.fileoutputcommitter.cleanup.skipped`:
|
|
|
+```python
|
|
|
+if "mapreduce.fileoutputcommitter.cleanup.skipped":
|
|
|
return
|
|
|
-if `mapreduce.manifest.committer.cleanup.parallel.delete`:
|
|
|
- attempt parallel delete of task directories; catch any exception
|
|
|
-if not `mapreduce.fileoutputcommitter.cleanup.skipped`:
|
|
|
- delete(`_temporary`); catch any exception
|
|
|
-if caught-exception and not `mapreduce.fileoutputcommitter.cleanup-failures.ignored`:
|
|
|
- throw caught-exception
|
|
|
+if "mapreduce.manifest.committer.cleanup.parallel.delete":
|
|
|
+ if "mapreduce.manifest.committer.cleanup.parallel.delete.base.first" :
|
|
|
+ if delete("_temporary"):
|
|
|
+ return
|
|
|
+ delete(list("$task-directories")) catch any exception
|
|
|
+if not "mapreduce.fileoutputcommitter.cleanup.skipped":
|
|
|
+ delete("_temporary"); catch any exception
|
|
|
+if caught-exception and not "mapreduce.fileoutputcommitter.cleanup-failures.ignored":
|
|
|
+ raise caught-exception
|
|
|
```
|
|
|
|
|
|
It's a bit complicated, but the goal is to perform a fast/scalable delete and
|
|
|
throw a meaningful exception if that didn't work.
|
|
|
|
|
|
-When working with ABFS and GCS, these settings should normally be left alone.
|
|
|
-If somehow errors surface during cleanup, enabling the option to
|
|
|
-ignore failures will ensure the job still completes.
|
|
|
+For ABFS set `mapreduce.manifest.committer.cleanup.parallel.delete.base.first` to `true`
|
|
|
+which should normally result in less network IO and a faster cleanup.
|
|
|
+
|
|
|
+```
|
|
|
+spark.hadoop.mapreduce.manifest.committer.cleanup.parallel.delete.base.first true
|
|
|
+```
|
|
|
+
|
|
|
+For GCS, setting `mapreduce.manifest.committer.cleanup.parallel.delete.base.first`
|
|
|
+to `false` may speed up cleanup.
|
|
|
+
|
|
|
+If somehow errors surface during cleanup, ignoring failures will ensure the job
|
|
|
+is still considered a success.
|
|
|
+`mapreduce.fileoutputcommitter.cleanup-failures.ignored = true`
|
|
|
+
|
|
|
Disabling cleanup even avoids the overhead of cleanup, but
|
|
|
requires a workflow or manual operation to clean up all
|
|
|
-`_temporary` directories on a regular basis.
|
|
|
-
|
|
|
+`_temporary` directories on a regular basis:
|
|
|
+`mapreduce.fileoutputcommitter.cleanup.skipped = true`.
|
|
|
|
|
|
# <a name="abfs"></a> Working with Azure ADLS Gen2 Storage
|
|
|
|
|
@@ -504,9 +648,15 @@ The core set of Azure-optimized options becomes
|
|
|
</property>
|
|
|
|
|
|
<property>
|
|
|
- <name>spark.hadoop.fs.azure.io.rate.limit</name>
|
|
|
- <value>10000</value>
|
|
|
+ <name>fs.azure.io.rate.limit</name>
|
|
|
+ <value>1000</value>
|
|
|
+</property>
|
|
|
+
|
|
|
+<property>
|
|
|
+ <name>mapreduce.manifest.committer.cleanup.parallel.delete.base.first</name>
|
|
|
+ <value>true</value>
|
|
|
</property>
|
|
|
+
|
|
|
```
|
|
|
|
|
|
And optional settings for debugging/performance analysis
|
|
@@ -514,7 +664,7 @@ And optional settings for debugging/performance analysis
|
|
|
```xml
|
|
|
<property>
|
|
|
<name>mapreduce.manifest.committer.summary.report.directory</name>
|
|
|
- <value>abfs:// Path within same store/separate store</value>
|
|
|
+ <value>Path within same store/separate store</value>
|
|
|
<description>Optional: path to where job summaries are saved</description>
|
|
|
</property>
|
|
|
```
|
|
@@ -523,14 +673,15 @@ And optional settings for debugging/performance analysis
|
|
|
|
|
|
```
|
|
|
spark.hadoop.mapreduce.outputcommitter.factory.scheme.abfs org.apache.hadoop.fs.azurebfs.commit.AzureManifestCommitterFactory
|
|
|
-spark.hadoop.fs.azure.io.rate.limit 10000
|
|
|
+spark.hadoop.fs.azure.io.rate.limit 1000
|
|
|
+spark.hadoop.mapreduce.manifest.committer.cleanup.parallel.delete.base.first true
|
|
|
spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
|
|
|
spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
|
|
|
|
|
|
spark.hadoop.mapreduce.manifest.committer.summary.report.directory (optional: URI of a directory for job summaries)
|
|
|
```
|
|
|
|
|
|
-## Experimental: ABFS Rename Rate Limiting `fs.azure.io.rate.limit`
|
|
|
+## <a name="abfs-rate-limit"></a> ABFS Rename Rate Limiting `fs.azure.io.rate.limit`
|
|
|
|
|
|
To avoid triggering store throttling and backoff delays, as well as other
|
|
|
throttling-related failure conditions file renames during job commit
|
|
@@ -544,13 +695,12 @@ may issue.
|
|
|
|
|
|
Set the option to `0` remove all rate limiting.
|
|
|
|
|
|
-The default value of this is set to 10000, which is the default IO capacity for
|
|
|
-an ADLS storage account.
|
|
|
+The default value of this is set to 1000.
|
|
|
|
|
|
```xml
|
|
|
<property>
|
|
|
<name>fs.azure.io.rate.limit</name>
|
|
|
- <value>10000</value>
|
|
|
+ <value>1000</value>
|
|
|
<description>maximum number of renames attempted per second</description>
|
|
|
</property>
|
|
|
```
|
|
@@ -569,7 +719,7 @@ If server-side throttling took place, signs of this can be seen in
|
|
|
* The store service's logs and their throttling status codes (usually 503 or 500).
|
|
|
* The job statistic `commit_file_rename_recovered`. This statistic indicates that
|
|
|
ADLS throttling manifested as failures in renames, failures which were recovered
|
|
|
- from in the comitter.
|
|
|
+ from in the committer.
|
|
|
|
|
|
If these are seen -or other applications running at the same time experience
|
|
|
throttling/throttling-triggered problems, consider reducing the value of
|
|
@@ -598,13 +748,14 @@ The Spark settings to switch to this committer are
|
|
|
spark.hadoop.mapreduce.outputcommitter.factory.scheme.gs org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory
|
|
|
spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
|
|
|
spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
|
|
|
-
|
|
|
+spark.hadoop.mapreduce.manifest.committer.cleanup.parallel.delete.base.first false
|
|
|
spark.hadoop.mapreduce.manifest.committer.summary.report.directory (optional: URI of a directory for job summaries)
|
|
|
```
|
|
|
|
|
|
The store's directory delete operations are `O(files)` so the value
|
|
|
of `mapreduce.manifest.committer.cleanup.parallel.delete`
|
|
|
-SHOULD be left at the default of `true`.
|
|
|
+SHOULD be left at the default of `true`, but
|
|
|
+`mapreduce.manifest.committer.cleanup.parallel.delete.base.first` changed to `false`
|
|
|
|
|
|
For mapreduce, declare the binding in `core-site.xml`or `mapred-site.xml`
|
|
|
```xml
|
|
@@ -639,19 +790,33 @@ spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOut
|
|
|
spark.hadoop.mapreduce.manifest.committer.summary.report.directory (optional: URI of a directory for job summaries)
|
|
|
```
|
|
|
|
|
|
-# <a name="advanced"></a> Advanced Topics
|
|
|
-
|
|
|
-## Advanced Configuration options
|
|
|
+# <a name="advanced"></a> Advanced Configuration options
|
|
|
|
|
|
There are some advanced options which are intended for development and testing,
|
|
|
rather than production use.
|
|
|
|
|
|
-| Option | Meaning | Default Value |
|
|
|
-|--------|----------------------------------------------|---------------|
|
|
|
-| `mapreduce.manifest.committer.store.operations.classname` | Classname for Manifest Store Operations | `""` |
|
|
|
-| `mapreduce.manifest.committer.validate.output` | Perform output validation? | `false` |
|
|
|
-| `mapreduce.manifest.committer.writer.queue.capacity` | Queue capacity for writing intermediate file | `32` |
|
|
|
+| Option | Meaning | Default Value |
|
|
|
+|-----------------------------------------------------------|-------------------------------------------------------------|---------------|
|
|
|
+| `mapreduce.manifest.committer.manifest.save.attempts` | How many attempts should be made to commit a task manifest? | `5` |
|
|
|
+| `mapreduce.manifest.committer.store.operations.classname` | Classname for Manifest Store Operations | `""` |
|
|
|
+| `mapreduce.manifest.committer.validate.output` | Perform output validation? | `false` |
|
|
|
+| `mapreduce.manifest.committer.writer.queue.capacity` | Queue capacity for writing intermediate file | `32` |
|
|
|
+
|
|
|
+### `mapreduce.manifest.committer.manifest.save.attempts`
|
|
|
+
|
|
|
+The number of attempts which should be made to save a task attempt manifest, which is done by
|
|
|
+1. Writing the file to a temporary file in the job attempt directory.
|
|
|
+2. Deleting any existing task manifest
|
|
|
+3. Renaming the temporary file to the final filename.
|
|
|
|
|
|
+This may fail for unrecoverable reasons (permissions, permanent loss of network, service down,...) or it may be
|
|
|
+a transient problem which may not reoccur if another attempt is made to write the data.
|
|
|
+
|
|
|
+The number of attempts to make is set by `mapreduce.manifest.committer.manifest.save.attempts`;
|
|
|
+the sleep time increases with each attempt.
|
|
|
+
|
|
|
+Consider increasing the default value if task attempts fail to commit their work
|
|
|
+and fail to recover from network problems.
|
|
|
|
|
|
### Validating output `mapreduce.manifest.committer.validate.output`
|
|
|
|
|
@@ -691,6 +856,34 @@ There is no need to alter these values, except when writing new implementations
|
|
|
something which is only needed if the store provides extra integration support for the
|
|
|
committer.
|
|
|
|
|
|
+### `mapreduce.manifest.committer.writer.queue.capacity`
|
|
|
+
|
|
|
+This is a secondary scale option.
|
|
|
+It controls the size of the queue for storing lists of files to rename from
|
|
|
+the manifests loaded from the target filesystem, manifests loaded
|
|
|
+from a pool of worker threads, and the single thread which saves
|
|
|
+the entries from each manifest to an intermediate file in the local filesystem.
|
|
|
+
|
|
|
+Once the queue is full, all manifest loading threads will block.
|
|
|
+
|
|
|
+```xml
|
|
|
+<property>
|
|
|
+ <name>mapreduce.manifest.committer.writer.queue.capacity</name>
|
|
|
+ <value>32</value>
|
|
|
+</property>
|
|
|
+```
|
|
|
+
|
|
|
+As the local filesystem is usually much faster to write to than any cloud store,
|
|
|
+this queue size should not be a limit on manifest load performance.
|
|
|
+
|
|
|
+It can help limit the amount of memory consumed during manifest load during
|
|
|
+job commit.
|
|
|
+The maximum number of loaded manifests will be:
|
|
|
+
|
|
|
+```
|
|
|
+mapreduce.manifest.committer.writer.queue.capacity + mapreduce.manifest.committer.io.threads
|
|
|
+```
|
|
|
+
|
|
|
## <a name="concurrent"></a> Support for concurrent jobs to the same directory
|
|
|
|
|
|
It *may* be possible to run multiple jobs targeting the same directory tree.
|