10 lat temu · c1f857b0b4
--- a/hadoop-common-project/hadoop-common/CHANGES.txt
+++ b/hadoop-common-project/hadoop-common/CHANGES.txt
@@ -441,6 +441,9 @@ Release 2.7.0 - UNRELEASED
 
				     HADOOP-11213. Typos in html pages: SecureMode and EncryptedShuffle. 
			
 
				     (Wei Yan via kasha)
			
 
				 
			
 
				+    HADOOP-11395. Add site documentation for Azure Storage FileSystem
			
 
				+    integration. (Chris Nauroth via Arpit Agarwal)
			
 
				+
			
 
				   OPTIMIZATIONS
			
 
				 
			
 
				     HADOOP-11323. WritableComparator#compare keeps reference to byte array.
			
--- a/hadoop-project/src/site/site.xml
+++ b/hadoop-project/src/site/site.xml
@@ -136,6 +136,7 @@
 
				     
			
 
				     <menu name="Hadoop Compatible File Systems" inherit="top">
			
 
				       <item name="Amazon S3" href="hadoop-aws/tools/hadoop-aws/index.html"/>
			
 
				+      <item name="Azure Blob Storage" href="hadoop-azure/index.html"/>
			
 
				       <item name="OpenStack Swift" href="hadoop-openstack/index.html"/>
			
 
				     </menu>
			
 
				 
			
--- a/hadoop-tools/hadoop-azure/README.txt
+++ b/hadoop-tools/hadoop-azure/README.txt
@@ -1,166 +0,0 @@
 
				-=============
			
 
				-Building
			
 
				-=============
			
 
				-basic compilation:
			
 
				-> mvn clean compile test-compile
			
 
				-
			
 
				-Compile, run tests and produce jar 
			
 
				-> mvn clean package
			
 
				-
			
 
				-=============
			
 
				-Unit tests
			
 
				-=============
			
 
				-Most of the tests will run without additional configuration.
			
 
				-For complete testing, configuration in src/test/resources is required:
			
 
				-  
			
 
				-  src/test/resources/azure-test.xml -> Defines Azure storage dependencies, including account information 
			
 
				-
			
 
				-The other files in src/test/resources do not normally need alteration:
			
 
				-  log4j.properties -> Test logging setup
			
 
				-  hadoop-metrics2-azure-file-system.properties -> used to wire up instrumentation for testing
			
 
				-  
			
 
				-From command-line
			
 
				-------------------
			
 
				-Basic execution:
			
 
				-> mvn test
			
 
				-
			
 
				-NOTES:
			
 
				- - The mvn pom.xml includes src/test/resources in the runtime classpath
			
 
				- - detailed output (such as log4j) appears in target\surefire-reports\TEST-{testName}.xml
			
 
				-   including log4j messages.
			
 
				-   
			
 
				-Run the tests and generate report:
			
 
				-> mvn site (at least once to setup some basics including images for the report)
			
 
				-> mvn surefire-report:report  (run and produce report)
			
 
				-> mvn mvn surefire-report:report-only  (produce report from last run)
			
 
				-> mvn mvn surefire-report:report-only -DshowSuccess=false (produce report from last run, only show errors)
			
 
				-> .\target\site\surefire-report.html (view the report)
			
 
				-
			
 
				-Via eclipse
			
 
				--------------
			
 
				-Manually add src\test\resources to the classpath for test run configuration:
			
 
				-  - run menu|run configurations|{configuration}|classpath|User Entries|advanced|add folder
			
 
				-
			
 
				-Then run via junit test runner.
			
 
				-NOTE:
			
 
				- - if you change log4.properties, rebuild the project to refresh the eclipse cache.
			
 
				-
			
 
				-Run Tests against Mocked storage.
			
 
				----------------------------------
			
 
				-These run automatically and make use of an in-memory emulation of azure storage.
			
 
				-
			
 
				-
			
 
				-Running tests against the Azure storage emulator  
			
 
				----------------------------------------------------
			
 
				-A selection of tests can run against the Azure Storage Emulator which is 
			
 
				-a high-fidelity emulation of live Azure Storage.  The emulator is sufficient for high-confidence testing.
			
 
				-The emulator is a Windows executable that runs on a local machine. 
			
 
				-
			
 
				-To use the emulator, install Azure SDK 2.3 and start the storage emulator
			
 
				-See http://msdn.microsoft.com/en-us/library/azure/hh403989.aspx
			
 
				-
			
 
				-Enable the Azure emulator tests by setting 
			
 
				-  fs.azure.test.emulator -> true 
			
 
				-in src\test\resources\azure-test.xml
			
 
				-
			
 
				-Known issues:
			
 
				-  Symptom: When running tests for emulator, you see the following failure message
			
 
				-           com.microsoft.windowsazure.storage.StorageException: The value for one of the HTTP headers is not in the correct format.
			
 
				-  Issue:   The emulator can get into a confused state.  
			
 
				-  Fix:     Restart the Azure Emulator.  Ensure it is v3.2 or later.
			
 
				- 
			
 
				-Running tests against live Azure storage 
			
 
				--------------------------------------------------------------------------
			
 
				-In order to run WASB unit tests against a live Azure Storage account, add credentials to 
			
 
				-src\test\resources\azure-test.xml. These settings augment the hadoop configuration object.
			
 
				-
			
 
				-For live tests, set the following in azure-test.xml:
			
 
				- 1. "fs.azure.test.account.name -> {azureStorageAccountName} 
			
 
				- 2. "fs.azure.account.key.{AccountName} -> {fullStorageKey}"
			
 
				- 
			
 
				-===================================
			
 
				-Page Blob Support and Configuration
			
 
				-===================================
			
 
				-
			
 
				-The Azure Blob Storage interface for Hadoop supports two kinds of blobs, block blobs
			
 
				-and page blobs. Block blobs are the default kind of blob and are good for most 
			
 
				-big-data use cases, like input data for Hive, Pig, analytical map-reduce jobs etc. 
			
 
				-Page blob handling in hadoop-azure was introduced to support HBase log files. 
			
 
				-Page blobs can be written any number of times, whereas block blobs can only be 
			
 
				-appended to 50,000 times before you run out of blocks and your writes will fail.
			
 
				-That won't work for HBase logs, so page blob support was introduced to overcome
			
 
				-this limitation.
			
 
				-
			
 
				-Page blobs can be used for other purposes beyond just HBase log files though.
			
 
				-They support the Hadoop FileSystem interface. Page blobs can be up to 1TB in
			
 
				-size, larger than the maximum 200GB size for block blobs.
			
 
				-
			
 
				-In order to have the files you create be page blobs, you must set the configuration
			
 
				-variable fs.azure.page.blob.dir to a comma-separated list of folder names.
			
 
				-E.g. 
			
 
				-
			
 
				-    /hbase/WALs,/hbase/oldWALs,/data/mypageblobfiles
			
 
				-    
			
 
				-You can set this to simply / to make all files page blobs.
			
 
				-
			
 
				-The configuration option fs.azure.page.blob.size is the default initial 
			
 
				-size for a page blob. It must be 128MB or greater, and no more than 1TB,
			
 
				-specified as an integer number of bytes.
			
 
				-
			
 
				-====================
			
 
				-Atomic Folder Rename
			
 
				-====================
			
 
				-
			
 
				-Azure storage stores files as a flat key/value store without formal support
			
 
				-for folders. The hadoop-azure file system layer simulates folders on top
			
 
				-of Azure storage. By default, folder rename in the hadoop-azure file system
			
 
				-layer is not atomic. That means that a failure during a folder rename 
			
 
				-could, for example, leave some folders in the original directory and
			
 
				-some in the new one.
			
 
				-
			
 
				-HBase depends on atomic folder rename. Hence, a configuration setting was
			
 
				-introduced called fs.azure.atomic.rename.dir that allows you to specify a 
			
 
				-comma-separated list of directories to receive special treatment so that 
			
 
				-folder rename is made atomic. The default value of this setting is just /hbase.
			
 
				-Redo will be applied to finish a folder rename that fails. A file  
			
 
				-<folderName>-renamePending.json may appear temporarily and is the record of 
			
 
				-the intention of the rename operation, to allow redo in event of a failure. 
			
 
				-
			
 
				-=============
			
 
				-Findbugs
			
 
				-=============
			
 
				-Run findbugs and show interactive GUI for review of problems
			
 
				-> mvn findbugs:gui 
			
 
				-
			
 
				-Run findbugs and fail build if errors are found:
			
 
				-> mvn findbugs:check
			
 
				-
			
 
				-For help with findbugs plugin.
			
 
				-> mvn findbugs:help
			
 
				-
			
 
				-=============
			
 
				-Checkstyle
			
 
				-=============
			
 
				-Rules for checkstyle @ src\config\checkstyle.xml
			
 
				- - these are based on a core set of standards, with exclusions for non-serious issues
			
 
				- - as a general plan it would be good to turn on more rules over time.
			
 
				- - Occasionally, run checkstyle with the default Sun rules by editing pom.xml.
			
 
				-
			
 
				-Command-line:
			
 
				-> mvn checkstyle:check --> just test & fail build if violations found
			
 
				-> mvn site checkstyle:checkstyle --> produce html report
			
 
				-> . target\site\checkstyle.html  --> view report.
			
 
				-
			
 
				-Eclipse:
			
 
				-- add the checkstyle plugin: Help|Install, site=http://eclipse-cs.sf.net/update
			
 
				-- window|preferences|checkstyle. Add src/config/checkstyle.xml. Set as default.
			
 
				-- project|properties|create configurations as required, eg src/main/java -> src/config/checkstyle.xml
			
 
				-
			
 
				-NOTE:
			
 
				-- After any change to the checkstyle rules xml, use window|preferences|checkstyle|{refresh}|OK
			
 
				-
			
 
				-=============
			
 
				-Javadoc
			
 
				-============= 
			
 
				-Command-line
			
 
				-> mvn javadoc:javadoc
			
--- a/hadoop-tools/hadoop-azure/src/site/markdown/index.md
+++ b/hadoop-tools/hadoop-azure/src/site/markdown/index.md
@@ -0,0 +1,243 @@
 
				+<!---
			
 
				+  Licensed under the Apache License, Version 2.0 (the "License");
			
 
				+  you may not use this file except in compliance with the License.
			
 
				+  You may obtain a copy of the License at
			
 
				+
			
 
				+   http://www.apache.org/licenses/LICENSE-2.0
			
 
				+
			
 
				+  Unless required by applicable law or agreed to in writing, software
			
 
				+  distributed under the License is distributed on an "AS IS" BASIS,
			
 
				+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
			
 
				+  See the License for the specific language governing permissions and
			
 
				+  limitations under the License. See accompanying LICENSE file.
			
 
				+-->
			
 
				+
			
 
				+# Hadoop Azure Support: Azure Blob Storage
			
 
				+
			
 
				+* [Introduction](#Introduction)
			
 
				+* [Features](#Features)
			
 
				+* [Limitations](#Limitations)
			
 
				+* [Usage](#Usage)
			
 
				+    * [Concepts](#Concepts)
			
 
				+    * [Configuring Credentials](#Configuring_Credentials)
			
 
				+    * [Page Blob Support and Configuration](#Page_Blob_Support_and_Configuration)
			
 
				+    * [Atomic Folder Rename](#Atomic_Folder_Rename)
			
 
				+    * [Accessing wasb URLs](#Accessing_wasb_URLs)
			
 
				+* [Testing the hadoop-azure Module](#Testing_the_hadoop-azure_Module)
			
 
				+
			
 
				+## <a name="Introduction" />Introduction
			
 
				+
			
 
				+The hadoop-azure module provides support for integration with
			
 
				+[Azure Blob Storage](http://azure.microsoft.com/en-us/documentation/services/storage/).
			
 
				+The built jar file, named hadoop-azure.jar, also declares transitive dependencies
			
 
				+on the additional artifacts it requires, notably the
			
 
				+[Azure Storage SDK for Java](https://github.com/Azure/azure-storage-java).
			
 
				+
			
 
				+## <a name="Features" />Features
			
 
				+
			
 
				+* Read and write data stored in an Azure Blob Storage account.
			
 
				+* Present a hierarchical file system view by implementing the standard Hadoop
			
 
				+  [`FileSystem`](../api/org/apache/hadoop/fs/FileSystem.html) interface.
			
 
				+* Supports configuration of multiple Azure Blob Storage accounts.
			
 
				+* Supports both page blobs (suitable for most use cases, such as MapReduce) and
			
 
				+  block blobs (suitable for continuous write use cases, such as an HBase
			
 
				+  write-ahead log).
			
 
				+* Reference file system paths using URLs using the `wasb` scheme.
			
 
				+* Also reference file system paths using URLs with the `wasbs` scheme for SSL
			
 
				+  encrypted access.
			
 
				+* Can act as a source of data in a MapReduce job, or a sink.
			
 
				+* Tested on both Linux and Windows.
			
 
				+* Tested at scale.
			
 
				+
			
 
				+## <a name="Limitations" />Limitations
			
 
				+
			
 
				+* The append operation is not implemented.
			
 
				+* File owner and group are persisted, but the permissions model is not enforced.
			
 
				+  Authorization occurs at the level of the entire Azure Blob Storage account.
			
 
				+* File last access time is not tracked.
			
 
				+
			
 
				+## <a name="Usage" />Usage
			
 
				+
			
 
				+### <a name="Concepts" />Concepts
			
 
				+
			
 
				+The Azure Blob Storage data model presents 3 core concepts:
			
 
				+
			
 
				+* **Storage Account**: All access is done through a storage account.
			
 
				+* **Container**: A container is a grouping of multiple blobs.  A storage account
			
 
				+  may have multiple containers.  In Hadoop, an entire file system hierarchy is
			
 
				+  stored in a single container.  It is also possible to configure multiple
			
 
				+  containers, effectively presenting multiple file systems that can be referenced
			
 
				+  using distinct URLs.
			
 
				+* **Blob**: A file of any type and size.  In Hadoop, files are stored in blobs.
			
 
				+  The internal implementation also uses blobs to persist the file system
			
 
				+  hierarchy and other metadata.
			
 
				+
			
 
				+### <a name="Configuring_Credentials" />Configuring Credentials
			
 
				+
			
 
				+Usage of Azure Blob Storage requires configuration of credentials.  Typically
			
 
				+this is set in core-site.xml.  The configuration property name is of the form
			
 
				+`fs.azure.account.key.<account name>.blob.core.windows.net` and the value is the
			
 
				+access key.  **The access key is a secret that protects access to your storage
			
 
				+account.  Do not share the access key (or the core-site.xml file) with an
			
 
				+untrusted party.**
			
 
				+
			
 
				+For example:
			
 
				+
			
 
				+    <property>
			
 
				+      <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
			
 
				+      <value>YOUR ACCESS KEY</value>
			
 
				+    </property>
			
 
				+
			
 
				+In many Hadoop clusters, the core-site.xml file is world-readable.  If it's
			
 
				+undesirable for the access key to be visible in core-site.xml, then it's also
			
 
				+possible to configure it in encrypted form.  An additional configuration property
			
 
				+specifies an external program to be invoked by Hadoop processes to decrypt the
			
 
				+key.  The encrypted key value is passed to this external program as a command
			
 
				+line argument:
			
 
				+
			
 
				+    <property>
			
 
				+      <name>fs.azure.account.keyprovider.youraccount</name>
			
 
				+      <value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
			
 
				+    </property>
			
 
				+
			
 
				+    <property>
			
 
				+      <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
			
 
				+      <value>YOUR ENCRYPTED ACCESS KEY</value>
			
 
				+    </property>
			
 
				+
			
 
				+    <property>
			
 
				+      <name>fs.azure.shellkeyprovider.script</name>
			
 
				+      <value>PATH TO DECRYPTION PROGRAM</value>
			
 
				+    </property>
			
 
				+
			
 
				+### <a name="Page_Blob_Support_and_Configuration" />Page Blob Support and Configuration
			
 
				+
			
 
				+The Azure Blob Storage interface for Hadoop supports two kinds of blobs,
			
 
				+[block blobs and page blobs](http://msdn.microsoft.com/en-us/library/azure/ee691964.aspx).
			
 
				+Block blobs are the default kind of blob and are good for most big-data use
			
 
				+cases, like input data for Hive, Pig, analytical map-reduce jobs etc.  Page blob
			
 
				+handling in hadoop-azure was introduced to support HBase log files.  Page blobs
			
 
				+can be written any number of times, whereas block blobs can only be appended to
			
 
				+50,000 times before you run out of blocks and your writes will fail.  That won't
			
 
				+work for HBase logs, so page blob support was introduced to overcome this
			
 
				+limitation.
			
 
				+
			
 
				+Page blobs can be used for other purposes beyond just HBase log files though.
			
 
				+Page blobs can be up to 1TB in size, larger than the maximum 200GB size for block
			
 
				+blobs.
			
 
				+
			
 
				+In order to have the files you create be page blobs, you must set the
			
 
				+configuration variable `fs.azure.page.blob.dir` to a comma-separated list of
			
 
				+folder names.
			
 
				+
			
 
				+For example:
			
 
				+
			
 
				+    <property>
			
 
				+      <name>fs.azure.page.blob.dir</name>
			
 
				+      <value>/hbase/WALs,/hbase/oldWALs,/data/mypageblobfiles</value>
			
 
				+    </property>
			
 
				+
			
 
				+You can set this to simply / to make all files page blobs.
			
 
				+
			
 
				+The configuration option `fs.azure.page.blob.size` is the default initial
			
 
				+size for a page blob. It must be 128MB or greater, and no more than 1TB,
			
 
				+specified as an integer number of bytes.
			
 
				+
			
 
				+The configuration option `fs.azure.page.blob.extension.size` is the page blob
			
 
				+extension size.  This defines the amount to extend a page blob if it starts to
			
 
				+get full.  It must be 128MB or greater, specified as an integer number of bytes.
			
 
				+
			
 
				+### <a name="Atomic_Folder_Rename" />Atomic Folder Rename
			
 
				+
			
 
				+Azure storage stores files as a flat key/value store without formal support
			
 
				+for folders.  The hadoop-azure file system layer simulates folders on top
			
 
				+of Azure storage.  By default, folder rename in the hadoop-azure file system
			
 
				+layer is not atomic.  That means that a failure during a folder rename
			
 
				+could, for example, leave some folders in the original directory and
			
 
				+some in the new one.
			
 
				+
			
 
				+HBase depends on atomic folder rename.  Hence, a configuration setting was
			
 
				+introduced called `fs.azure.atomic.rename.dir` that allows you to specify a
			
 
				+comma-separated list of directories to receive special treatment so that
			
 
				+folder rename is made atomic.  The default value of this setting is just
			
 
				+`/hbase`.  Redo will be applied to finish a folder rename that fails. A file
			
 
				+`<folderName>-renamePending.json` may appear temporarily and is the record of
			
 
				+the intention of the rename operation, to allow redo in event of a failure.
			
 
				+
			
 
				+For example:
			
 
				+
			
 
				+    <property>
			
 
				+      <name>fs.azure.atomic.rename.dir</name>
			
 
				+      <value>/hbase,/data</value>
			
 
				+    </property>
			
 
				+
			
 
				+### <a name="Accessing_wasb_URLs" />Accessing wasb URLs
			
 
				+
			
 
				+After credentials are configured in core-site.xml, any Hadoop component may
			
 
				+reference files in that Azure Blob Storage account by using URLs of the following
			
 
				+format:
			
 
				+
			
 
				+    wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>
			
 
				+
			
 
				+The schemes `wasb` and `wasbs` identify a URL on a file system backed by Azure
			
 
				+Blob Storage.  `wasb` utilizes unencrypted HTTP access for all interaction with
			
 
				+the Azure Blob Storage API.  `wasbs` utilizes SSL encrypted HTTPS access.
			
 
				+
			
 
				+For example, the following
			
 
				+[FileSystem Shell](../hadoop-project-dist/hadoop-common/FileSystemShell.html)
			
 
				+commands demonstrate access to a storage account named `youraccount` and a
			
 
				+container named `yourcontainer`.
			
 
				+
			
 
				+    > hadoop fs -mkdir wasb://yourcontainer@youraccount.blob.core.windows.net/testDir
			
 
				+
			
 
				+    > hadoop fs -put testFile wasb://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
			
 
				+
			
 
				+    > hadoop fs -cat wasbs://yourcontainer@youraccount.blob.core.windows.net/testDir/testFile
			
 
				+    test file content
			
 
				+
			
 
				+It's also possible to configure `fs.defaultFS` to use a `wasb` or `wasbs` URL.
			
 
				+This causes all bare paths, such as `/testDir/testFile` to resolve automatically
			
 
				+to that file system.
			
 
				+
			
 
				+## <a name="Testing_the_hadoop-azure_Module" />Testing the hadoop-azure Module
			
 
				+
			
 
				+The hadoop-azure module includes a full suite of unit tests.  Most of the tests
			
 
				+will run without additional configuration by running `mvn test`.  This includes
			
 
				+tests against mocked storage, which is an in-memory emulation of Azure Storage.
			
 
				+
			
 
				+A selection of tests can run against the
			
 
				+[Azure Storage Emulator](http://msdn.microsoft.com/en-us/library/azure/hh403989.aspx)
			
 
				+which is a high-fidelity emulation of live Azure Storage.  The emulator is
			
 
				+sufficient for high-confidence testing.  The emulator is a Windows executable
			
 
				+that runs on a local machine.
			
 
				+
			
 
				+To use the emulator, install Azure SDK 2.3 and start the storage emulator.  Then,
			
 
				+edit `src/test/resources/azure-test.xml` and add the following property:
			
 
				+
			
 
				+    <property>
			
 
				+      <name>fs.azure.test.emulator</name>
			
 
				+      <value>true</value>
			
 
				+    </property>
			
 
				+
			
 
				+There is a known issue when running tests with the emulator.  You may see the
			
 
				+following failure message:
			
 
				+
			
 
				+    com.microsoft.windowsazure.storage.StorageException: The value for one of the HTTP headers is not in the correct format.
			
 
				+
			
 
				+To resolve this, restart the Azure Emulator.  Ensure it v3.2 or later.
			
 
				+
			
 
				+It's also possible to run tests against a live Azure Storage account by adding
			
 
				+credentials to `src/test/resources/azure-test.xml` and setting
			
 
				+`fs.azure.test.account.name` to the name of the storage account.
			
 
				+
			
 
				+For example:
			
 
				+
			
 
				+    <property>
			
 
				+      <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
			
 
				+      <value>YOUR ACCESS KEY</value>
			
 
				+    </property>
			
 
				+
			
 
				+    <property>
			
 
				+      <name>fs.azure.test.account.name</name>
			
 
				+      <value>youraccount</value>
			
 
				+    </property>