Browse Source

HADOOP-16401. ABFS: port Azure doc to 3.2 branch.

Signed-off-by: Masatake Iwasaki <iwasakims@apache.org>
Masatake Iwasaki 6 years ago
parent
commit
b6718c754a

+ 746 - 33
hadoop-tools/hadoop-azure/src/site/markdown/abfs.md

@@ -16,67 +16,780 @@
 
 <!-- MACRO{toc|fromDepth=1|toDepth=3} -->
 
-## Introduction
+## <a name="introduction"></a> Introduction
 
 The `hadoop-azure` module provides support for the Azure Data Lake Storage Gen2
 storage layer through the "abfs" connector
 
-To make it part of Apache Hadoop's default classpath, simply make sure that
-`HADOOP_OPTIONAL_TOOLS` in `hadoop-env.sh` has `hadoop-azure` in the list.
+To make it part of Apache Hadoop's default classpath, make sure that
+`HADOOP_OPTIONAL_TOOLS` environment variable has `hadoop-azure` in the list,
+*on every machine in the cluster*
 
-## Features
+```bash
+export HADOOP_OPTIONAL_TOOLS=hadoop-azure
+```
 
-* Read and write data stored in an Azure Blob Storage account.
+You can set this locally in your `.profile`/`.bashrc`, but note it won't
+propagate to jobs running in-cluster.
+
+
+## <a name="features"></a> Features of the ABFS connector.
+
+* Supports reading and writing data stored in an Azure Blob Storage account.
 * *Fully Consistent* view of the storage across all clients.
-* Can read data written through the wasb: connector.
-* Present a hierarchical file system view by implementing the standard Hadoop
+* Can read data written through the `wasb:` connector.
+* Presents a hierarchical file system view by implementing the standard Hadoop
   [`FileSystem`](../api/org/apache/hadoop/fs/FileSystem.html) interface.
 * Supports configuration of multiple Azure Blob Storage accounts.
-* Can act as a source or destination of data in Hadoop MapReduce, Apache Hive, Apache Spark
-* Tested at scale on both Linux and Windows.
+* Can act as a source or destination of data in Hadoop MapReduce, Apache Hive, Apache Spark.
+* Tested at scale on both Linux and Windows by Microsoft themselves.
 * Can be used as a replacement for HDFS on Hadoop clusters deployed in Azure infrastructure.
 
+For details on ABFS, consult the following documents:
+
+* [A closer look at Azure Data Lake Storage Gen2](https://azure.microsoft.com/en-gb/blog/a-closer-look-at-azure-data-lake-storage-gen2/);
+MSDN Article from June 28, 2018.
+* [Storage Tiers](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers)
+
+## Getting started
 
+### Concepts
 
-## Limitations
+The Azure Storage data model presents 3 core concepts:
 
-* File last access time is not tracked.
+* **Storage Account**: All access is done through a storage account.
+* **Container**: A container is a grouping of multiple blobs.  A storage account
+  may have multiple containers.  In Hadoop, an entire file system hierarchy is
+  stored in a single container.
+* **Blob**: A file of any type and size stored with the existing wasb connector
+
+The ABFS connector connects to classic containers, or those created
+with Hierarchical Namespaces.
+
+## <a name="namespaces"></a> Hierarchical Namespaces (and WASB Compatibility)
+
+A key aspect of ADLS Gen 2 is its support for
+[hierachical namespaces](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace)
+These are effectively directories and offer high performance rename and delete operations
+—something which makes a significant improvement in performance in query engines
+writing data to, including MapReduce, Spark, Hive, as well as DistCp.
+
+This feature is only available if the container was created with "namespace"
+support.
+
+You enable namespace support when creating a new Storage Account,
+by checking the "Hierarchical Namespace" option in the Portal UI, or, when
+creating through the command line, using the option `--hierarchical-namespace true`
+
+_You cannot enable Hierarchical Namespaces on an existing storage account_
+
+Containers in a storage account with Hierarchical Namespaces are
+not (currently) readable through the `wasb:` connector.
+
+Some of the `az storage` command line commands fail too, for example:
+
+```bash
+$ az storage container list --account-name abfswales1
+Blob API is not yet supported for hierarchical namespace accounts. ErrorCode: BlobApiNotYetSupportedForHierarchicalNamespaceAccounts
+```
+
+### <a name="creating"></a> Creating an Azure Storage Account
+
+The best documentation on getting started with Azure Datalake Gen2 with the
+abfs connector is [Using Azure Data Lake Storage Gen2 with Azure HDInsight clusters](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-use-hdi-cluster)
+
+It includes instructions to create it from [the Azure command line tool](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest),
+which can be installed on Windows, MacOS (via Homebrew) and Linux (apt or yum).
+
+The [az storage](https://docs.microsoft.com/en-us/cli/azure/storage?view=azure-cli-latest) subcommand
+handles all storage commands, [`az storage account create`](https://docs.microsoft.com/en-us/cli/azure/storage/account?view=azure-cli-latest#az-storage-account-create)
+does the creation.
+
+Until the ADLS gen2 API support is finalized, you need to add an extension
+to the ADLS command.
+```bash
+az extension add --name storage-preview
+```
+
+Check that all is well by verifying that the usage command includes `--hierarchical-namespace`:
+```
+$  az storage account
+usage: az storage account create [-h] [--verbose] [--debug]
+     [--output {json,jsonc,table,tsv,yaml,none}]
+     [--query JMESPATH] --resource-group
+     RESOURCE_GROUP_NAME --name ACCOUNT_NAME
+     [--sku {Standard_LRS,Standard_GRS,Standard_RAGRS,Standard_ZRS,Premium_LRS,Premium_ZRS}]
+     [--location LOCATION]
+     [--kind {Storage,StorageV2,BlobStorage,FileStorage,BlockBlobStorage}]
+     [--tags [TAGS [TAGS ...]]]
+     [--custom-domain CUSTOM_DOMAIN]
+     [--encryption-services {blob,file,table,queue} [{blob,file,table,queue} ...]]
+     [--access-tier {Hot,Cool}]
+     [--https-only [{true,false}]]
+     [--file-aad [{true,false}]]
+     [--hierarchical-namespace [{true,false}]]
+     [--bypass {None,Logging,Metrics,AzureServices} [{None,Logging,Metrics,AzureServices} ...]]
+     [--default-action {Allow,Deny}]
+     [--assign-identity]
+     [--subscription _SUBSCRIPTION]
+```
+
+You can list locations from `az account list-locations`, which lists the
+name to refer to in the `--location` argument:
+```
+$ az account list-locations -o table
+
+DisplayName          Latitude    Longitude    Name
+-------------------  ----------  -----------  ------------------
+East Asia            22.267      114.188      eastasia
+Southeast Asia       1.283       103.833      southeastasia
+Central US           41.5908     -93.6208     centralus
+East US              37.3719     -79.8164     eastus
+East US 2            36.6681     -78.3889     eastus2
+West US              37.783      -122.417     westus
+North Central US     41.8819     -87.6278     northcentralus
+South Central US     29.4167     -98.5        southcentralus
+North Europe         53.3478     -6.2597      northeurope
+West Europe          52.3667     4.9          westeurope
+Japan West           34.6939     135.5022     japanwest
+Japan East           35.68       139.77       japaneast
+Brazil South         -23.55      -46.633      brazilsouth
+Australia East       -33.86      151.2094     australiaeast
+Australia Southeast  -37.8136    144.9631     australiasoutheast
+South India          12.9822     80.1636      southindia
+Central India        18.5822     73.9197      centralindia
+West India           19.088      72.868       westindia
+Canada Central       43.653      -79.383      canadacentral
+Canada East          46.817      -71.217      canadaeast
+UK South             50.941      -0.799       uksouth
+UK West              53.427      -3.084       ukwest
+West Central US      40.890      -110.234     westcentralus
+West US 2            47.233      -119.852     westus2
+Korea Central        37.5665     126.9780     koreacentral
+Korea South          35.1796     129.0756     koreasouth
+France Central       46.3772     2.3730       francecentral
+France South         43.8345     2.1972       francesouth
+Australia Central    -35.3075    149.1244     australiacentral
+Australia Central 2  -35.3075    149.1244     australiacentral2
+```
+
+Once a location has been chosen, create the account
+```bash
+
+az storage account create --verbose \
+    --name abfswales1 \
+    --resource-group devteam2 \
+    --kind StorageV2 \
+    --hierarchical-namespace true \
+    --location ukwest \
+    --sku Standard_LRS \
+    --https-only true \
+    --encryption-services blob \
+    --access-tier Hot \
+    --tags owner=engineering \
+    --assign-identity \
+    --output jsonc
+```
+
+The output of the command is a JSON file, whose `primaryEndpoints` command
+includes the name of the store endpoint:
+```json
+{
+  "primaryEndpoints": {
+    "blob": "https://abfswales1.blob.core.windows.net/",
+    "dfs": "https://abfswales1.dfs.core.windows.net/",
+    "file": "https://abfswales1.file.core.windows.net/",
+    "queue": "https://abfswales1.queue.core.windows.net/",
+    "table": "https://abfswales1.table.core.windows.net/",
+    "web": "https://abfswales1.z35.web.core.windows.net/"
+  }
+}
+```
+
+The `abfswales1.dfs.core.windows.net` account is the name by which the
+storage account will be referred to.
+
+Now ask for the connection string to the store, which contains the account key
+```bash
+az storage account  show-connection-string --name abfswales1
+{
+  "connectionString": "DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=abfswales1;AccountKey=ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA=="
+}
+```
+
+You then need to add the access key to your `core-site.xml`, JCEKs file or
+use your cluster management tool to set it the option `fs.azure.account.key.STORAGE-ACCOUNT`
+to this value.
+```XML
+<property>
+  <name>fs.azure.account.key.abfswales1.dfs.core.windows.net</name>
+  <value>ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA==</value>
+</property>
+```
+
+#### Creation through the Azure Portal
+
+Creation through the portal is covered in [Quickstart: Create an Azure Data Lake Storage Gen2 storage account](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-quickstart-create-account)
+
+Key Steps
+
+1. Create a new Storage Account in a location which suits you.
+1. "Basics" Tab: select "StorageV2".
+1. "Advanced" Tab: enable "Hierarchical Namespace".
+
+You have now created your storage account. Next, get the key for authentication
+for using the default "Shared Key" authentication.
+
+1. Go to the Azure Portal.
+1. Select "Storage Accounts"
+1. Select the newly created storage account.
+1. In the list of settings, locate "Access Keys" and select that.
+1. Copy one of the access keys to the clipboard, add to the XML option,
+set in cluster management tools, Hadoop JCEKS file or KMS store.
+
+### <a name="new_container"></a> Creating a new container
 
+An Azure storage account can have multiple containers, each with the container
+name as the userinfo field of the URI used to reference it.
 
-## Technical notes
+For example, the container "container1" in the storage account just created
+will have the URL `abfs://container1@abfswales1.dfs.core.windows.net/`
 
-### Security
 
-### Consistency and Concurrency
+You can create a new container through the ABFS connector, by setting the option
+ `fs.azure.createRemoteFileSystemDuringInitialization` to `true`.
 
-*TODO*: complete/review
+If the container does not exist, an attempt to list it with `hadoop fs -ls`
+will fail
 
-The abfs client has a fully consistent view of the store, which has complete Create Read Update and Delete consistency for data and metadata.
-(Compare and contrast with S3 which only offers Create consistency; S3Guard adds CRUD to metadata, but not the underlying data).
+```
+$ hadoop fs -ls abfs://container1@abfswales1.dfs.core.windows.net/
 
-### Performance
+ls: `abfs://container1@abfswales1.dfs.core.windows.net/': No such file or directory
+```
 
-*TODO*: check these.
+Enable remote FS creation and the second attempt succeeds, creating the container as it does so:
 
-* File Rename: `O(1)`.
-* Directory Rename: `O(files)`.
-* Directory Delete: `O(files)`.
+```
+$ hadoop fs -D fs.azure.createRemoteFileSystemDuringInitialization=true \
+ -ls abfs://container1@abfswales1.dfs.core.windows.net/
+```
 
-## Configuring ABFS
+This is useful for creating accounts on the command line, especially before
+the `az storage` command supports hierarchical namespaces completely.
 
-Any configuration can be specified generally (or as the default when accessing all accounts) or can be tied to s a specific account.
-For example, an OAuth identity can be configured for use regardless of which account is accessed with the property
-"fs.azure.account.oauth2.client.id"
+
+### Listing and examining containers of a Storage Account.
+
+You can use the [Azure Storage Explorer](https://azure.microsoft.com/en-us/features/storage-explorer/)
+
+## <a name="configuring"></a> Configuring ABFS
+
+Any configuration can be specified generally (or as the default when accessing all accounts)
+or can be tied to a specific account.
+For example, an OAuth identity can be configured for use regardless of which
+account is accessed with the property `fs.azure.account.oauth2.client.id`
 or you can configure an identity to be used only for a specific storage account with
-"fs.azure.account.oauth2.client.id.\<account\_name\>.dfs.core.windows.net".
+`fs.azure.account.oauth2.client.id.<account_name>.dfs.core.windows.net`.
 
-Note that it doesn't make sense to do this with some properties, like shared keys that are inherently account-specific.
+This is shown in the Authentication section.
 
-## Testing ABFS
+## <a name="authentication"></a> Authentication
 
-See the relevant section in [Testing Azure](testing_azure.html).
+Authentication for ABFS is ultimately granted by [Azure Active Directory](https://docs.microsoft.com/en-us/azure/active-directory/develop/authentication-scenarios).
 
-## References
+The concepts covered there are beyond the scope of this document to cover;
+developers are expected to have read and understood the concepts therein
+to take advantage of the different authentication mechanisms.
 
-* [A closer look at Azure Data Lake Storage Gen2](https://azure.microsoft.com/en-gb/blog/a-closer-look-at-azure-data-lake-storage-gen2/);
-MSDN Article from June 28, 2018.
+What is covered here, briefly, is how to configure the ABFS client to authenticate
+in different deployment situations.
+
+The ABFS client can be deployed in different ways, with its authentication needs
+driven by them.
+
+1. With the storage account's authentication secret in the configuration:
+"Shared Key".
+1. Using OAuth 2.0 tokens of one form or another.
+1. Deployed in-Azure with the Azure VMs providing OAuth 2.0 tokens to the application,
+ "Managed Instance".
+
+What can be changed is what secrets/credentials are used to authenticate the caller.
+
+The authentication mechanism is set in `fs.azure.account.auth.type` (or the account specific variant),
+and, for the various OAuth options `fs.azure.account.oauth.provider.type`
+
+All secrets can be stored in JCEKS files. These are encrypted and password
+protected —use them or a compatible Hadoop Key Management Store wherever
+possible
+
+### <a name="shared-key-auth"></a> Default: Shared Key
+
+This is the simplest authentication mechanism of account + password.
+
+The account name is inferred from the URL;
+the password, "key", retrieved from the XML/JCECKs configuration files.
+
+```xml
+<property>
+  <name>fs.azure.account.auth.type.abfswales1.dfs.core.windows.net</name>
+  <value>SharedKey</value>
+  <description>
+  </description>
+</property>
+<property>
+  <name>fs.azure.account.key.abfswales1.dfs.core.windows.net</name>
+  <value>ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA==</value>
+  <description>
+  The secret password. Never share these.
+  </description>
+</property>
+```
+
+*Note*: The source of the account key can be changed through a custom key provider;
+one exists to execute a shell script to retrieve it.
+
+### <a name="oauth-client-credentials"></a> OAuth 2.0 Client Credentials
+
+OAuth 2.0 credentials of (client id, client secret, endpoint) are provided in the configuration/JCEKS file.
+
+The specifics of this process is covered
+in [hadoop-azure-datalake](../hadoop-azure-datalake/index.html#Configuring_Credentials_and_FileSystem);
+the key names are slightly different here.
+
+```xml
+<property>
+  <name>fs.azure.account.auth.type</name>
+  <value>OAuth</value>
+  <description>
+  Use OAuth authentication
+  </description>
+</property>
+<property>
+  <name>fs.azure.account.oauth.provider.type</name>
+  <value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value>
+  <description>
+  Use client credentials
+  </description>
+</property>
+<property>
+  <name>fs.azure.account.oauth2.client.endpoint</name>
+  <value></value>
+  <description>
+  URL of OAuth endpoint
+  </description>
+</property>
+<property>
+  <name>fs.azure.account.oauth2.client.id</name>
+  <value></value>
+  <description>
+  Client ID
+  </description>
+</property>
+<property>
+  <name>fs.azure.account.oauth2.client.secret</name>
+  <value></value>
+  <description>
+  Secret
+  </description>
+</property>
+```
+
+### <a name="oauth-user-and-passwd"></a> OAuth 2.0: Username and Password
+
+An OAuth 2.0 endpoint, username and password are provided in the configuration/JCEKS file.
+
+```xml
+<property>
+  <name>fs.azure.account.auth.type</name>
+  <value>OAuth</value>
+  <description>
+  Use OAuth authentication
+  </description>
+</property>
+<property>
+  <name>fs.azure.account.oauth.provider.type</name>
+  <value>org.apache.hadoop.fs.azurebfs.oauth2.UserPasswordTokenProvider</value>
+  <description>
+  Use user and password
+  </description>
+</property>
+<property>
+  <name>fs.azure.account.oauth2.client.endpoint</name>
+  <value></value>
+  <description>
+  URL of OAuth 2.0 endpoint
+  </description>
+</property>
+<property>
+  <name>fs.azure.account.oauth2.user.name</name>
+  <value></value>
+  <description>
+  username
+  </description>
+</property>
+<property>
+  <name>fs.azure.account.oauth2.user.password</name>
+  <value></value>
+  <description>
+  password for account
+  </description>
+</property>
+```
+
+### <a name="oauth-refresh-token"></a> OAuth 2.0: Refresh Token
+
+With an existing Oauth 2.0 token, make a request of the Active Directory endpoint
+`https://login.microsoftonline.com/Common/oauth2/token` for this token to be refreshed.
+
+```xml
+<property>
+  <name>fs.azure.account.auth.type</name>
+  <value>OAuth</value>
+  <description>
+  Use OAuth 2.0 authentication
+  </description>
+</property>
+<property>
+  <name>fs.azure.account.oauth.provider.type</name>
+  <value>org.apache.hadoop.fs.azurebfs.oauth2.RefreshTokenBasedTokenProvider</value>
+  <description>
+  Use the Refresh Token Provider
+  </description>
+</property>
+<property>
+  <name>fs.azure.account.oauth2.refresh.token</name>
+  <value></value>
+  <description>
+  Refresh token
+  </description>
+</property>
+<property>
+  <name>fs.azure.account.oauth2.client.id</name>
+  <value></value>
+  <description>
+  Optional Client ID
+  </description>
+</property>
+```
+
+### <a name="managed-identity"></a> Azure Managed Identity
+
+[Azure Managed Identities](https://docs.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/overview), formerly "Managed Service Identities".
+
+OAuth 2.0 tokens are issued by a special endpoint only accessible
+from the executing VM (`http://169.254.169.254/metadata/identity/oauth2/token`).
+The issued credentials can be used to authenticate.
+
+The Azure Portal/CLI is used to create the service identity.
+
+```xml
+<property>
+  <name>fs.azure.account.auth.type</name>
+  <value>OAuth</value>
+  <description>
+  Use OAuth authentication
+  </description>
+</property>
+<property>
+  <name>fs.azure.account.oauth.provider.type</name>
+  <value>org.apache.hadoop.fs.azurebfs.oauth2.MsiTokenProvider</value>
+  <description>
+  Use MSI for issuing OAuth tokens
+  </description>
+</property>
+<property>
+  <name>fs.azure.account.oauth2.msi.tenant</name>
+  <value></value>
+  <description>
+  Optional MSI Tenant ID
+  </description>
+</property>
+<property>
+  <name>fs.azure.account.oauth2.client.id</name>
+  <value></value>
+  <description>
+  Optional Client ID
+  </description>
+</property>
+```
+
+### Custom OAuth 2.0 Token Provider
+
+A Custom OAuth 2.0 token provider supplies the ABFS connector with an OAuth 2.0
+token when its `getAccessToken()` method is invoked.
+
+```xml
+<property>
+  <name>fs.azure.account.auth.type</name>
+  <value>Custom</value>
+  <description>
+  Custom Authentication
+  </description>
+</property>
+<property>
+  <name>fs.azure.account.oauth.provider.type</name>
+  <value></value>
+  <description>
+  classname of Custom Authentication Provider
+  </description>
+</property>
+```
+
+The declared class must implement `org.apache.hadoop.fs.azurebfs.extensions.CustomTokenProviderAdaptee`
+and optionally `org.apache.hadoop.fs.azurebfs.extensions.BoundDTExtension`.
+
+## <a name="technical"></a> Technical notes
+
+### <a name="proxy"></a> Proxy setup
+
+The connector uses the JVM proxy settings to control its proxy setup.
+
+See The [Oracle Java documentation](https://docs.oracle.com/javase/8/docs/technotes/guides/net/proxies.html) for the options to set.
+
+As the connector uses HTTPS by default, the `https.proxyHost` and `https.proxyPort`
+options are those which must be configured.
+
+In MapReduce jobs, including distcp, the proxy options must be set in both the
+`mapreduce.map.java.opts` and `mapreduce.reduce.java.opts`.
+
+```bash
+# this variable is only here to avoid typing the same values twice.
+# It's name is not important.
+export DISTCP_PROXY_OPTS="-Dhttps.proxyHost=web-proxy.example.com -Dhttps.proxyPort=80"
+
+hadoop distcp \
+  -D mapreduce.map.java.opts="$DISTCP_PROXY_OPTS" \
+  -D mapreduce.reduce.java.opts="$DISTCP_PROXY_OPTS" \
+  -update -skipcrccheck -numListstatusThreads 40 \
+  hdfs://namenode:8020/users/alice abfs://backups@account.dfs.core.windows.net/users/alice
+```
+
+Without these settings, even though access to ADLS may work from the command line,
+`distcp` access can fail with network errors.
+
+### <a name="security"></a> Security
+
+As with other object stores, login secrets are valuable pieces of information.
+Organizations should have a process for safely sharing them.
+
+### <a name="limitations"></a> Limitations of the ABFS connector
+
+* File last access time is not tracked.
+* Extended attributes are not supported.
+* File Checksums are not supported.
+* The `Syncable` interfaces `hsync()` and `hflush()` operations are supported if
+`fs.azure.enable.flush` is set to true (default=true). With the Wasb connector,
+this limited the number of times either call could be made to 50,000
+[HADOOP-15478](https://issues.apache.org/jira/browse/HADOOP-15478).
+If abfs has the a similar limit, then excessive use of sync/flush may
+cause problems.
+
+### <a name="consistency"></a> Consistency and Concurrency
+
+As with all Azure storage services, the Azure Datalake Gen 2 store offers
+a fully consistent view of the store, with complete
+Create, Read, Update, and Delete consistency for data and metadata.
+(Compare and contrast with S3 which only offers Create consistency;
+S3Guard adds CRUD to metadata, but not the underlying data).
+
+### <a name="performance"></a> Performance and Scalability
+
+For containers with hierarchical namespaces,
+the scalability numbers are, in Big-O-notation, as follows:
+
+| Operation | Scalability |
+|-----------|-------------|
+| File Rename | `O(1)` |
+| File Delete | `O(1)` |
+| Directory Rename:| `O(1)` |
+| Directory Delete | `O(1)` |
+
+For non-namespace stores, the scalability becomes:
+
+| Operation | Scalability |
+|-----------|-------------|
+| File Rename | `O(1)` |
+| File Delete | `O(1)` |
+| Directory Rename:| `O(files)` |
+| Directory Delete | `O(files)` |
+
+That is: the more files there are, the slower directory operations get.
+
+
+Further reading: [Azure Storage Scalability Targets](https://docs.microsoft.com/en-us/azure/storage/common/storage-scalability-targets?toc=%2fazure%2fstorage%2fqueues%2ftoc.json)
+
+### <a name="extensibility"></a> Extensibility
+
+The ABFS connector supports a number of limited-private/unstable extension
+points for third-parties to integrate their authentication and authorization
+services into the ABFS client.
+
+* `CustomDelegationTokenManager` : adds ability to issue Hadoop Delegation Tokens.
+* `AbfsAuthorizer` permits client-side authorization of file operations.
+* `CustomTokenProviderAdaptee`: allows for custom provision of
+Azure OAuth tokens.
+* `KeyProvider`.
+
+Consult the source in `org.apache.hadoop.fs.azurebfs.extensions`
+and all associated tests to see how to make use of these extension points.
+
+_Warning_ These extension points are unstable.
+
+## <a href="options"></a> Other configuration options
+
+Consult the javadocs for `org.apache.hadoop.fs.azurebfs.constants.ConfigurationKeys`,
+`org.apache.hadoop.fs.azurebfs.constants.FileSystemConfigurations` and
+`org.apache.hadoop.fs.azurebfs.AbfsConfiguration` for the full list
+of configuration options and their default values.
+
+
+## <a name="troubleshooting"></a> Troubleshooting
+
+The problems associated with the connector usually come down to, in order
+
+1. Classpath.
+1. Network setup (proxy etc.).
+1. Authentication and Authorization.
+1. Anything else.
+
+If you log `org.apache.hadoop.fs.azurebfs.services` at `DEBUG` then you will
+see more details about any request which is failing.
+
+One useful tool for debugging connectivity is the [cloudstore storediag utility](https://github.com/steveloughran/cloudstore/releases).
+
+This validates the classpath, the settings, then tries to work with the filesystem.
+
+```bash
+bin/hadoop jar cloudstore-0.1-SNAPSHOT.jar storediag abfs://container@account.dfs.core.windows.net/
+```
+
+1. If the `storediag` command cannot work with an abfs store, nothing else is likely to.
+1. If the `storediag` store does successfully work, that does not guarantee that the classpath
+or configuration on the rest of the cluster is also going to work, especially
+in distributed applications. But it is at least a start.
+
+### `ClassNotFoundException: org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem`
+
+The `hadoop-azure` JAR is not on the classpah.
+
+```
+java.lang.RuntimeException: java.lang.ClassNotFoundException:
+    Class org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem not found
+  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2625)
+  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3290)
+  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3322)
+  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:136)
+  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3373)
+  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3341)
+  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:491)
+  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
+Caused by: java.lang.ClassNotFoundException:
+    Class org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem not found
+  at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2529)
+  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2623)
+  ... 16 more
+```
+
+Tip: if this is happening on the command line, you can turn on debug logging
+of the hadoop scripts:
+
+```bash
+export HADOOP_SHELL_SCRIPT_DEBUG=true
+```
+
+If this is happening on an application running within the cluster, it means
+the cluster (somehow) needs to be configured so that the `hadoop-azure`
+module and dependencies are on the classpath of deployed applications.
+
+### `ClassNotFoundException: com.microsoft.azure.storage.StorageErrorCode`
+
+The `azure-storage` JAR is not on the classpath.
+
+### `Server failed to authenticate the request`
+
+The request wasn't authenticated while using the default shared-key
+authentication mechanism.
+
+```
+Operation failed: "Server failed to authenticate the request.
+ Make sure the value of Authorization header is formed correctly including the signature.",
+ 403, HEAD, https://account.dfs.core.windows.net/container2?resource=filesystem&timeout=90
+  at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:135)
+  at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getFilesystemProperties(AbfsClient.java:209)
+  at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFilesystemProperties(AzureBlobFileSystemStore.java:259)
+  at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.fileSystemExists(AzureBlobFileSystem.java:859)
+  at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:110)
+```
+
+Causes include:
+
+* Your credentials are incorrect.
+* Your shared secret has expired. in Azure, this happens automatically
+* Your shared secret has been revoked.
+* host/VM clock drift means that your client's clock is out of sync with the
+Azure servers —the call is being rejected as it is either out of date (considered a replay)
+or from the future. Fix: Check your clocks, etc.
+
+### `Configuration property _something_.dfs.core.windows.net not found`
+
+There's no `fs.azure.account.key.` entry in your cluster configuration declaring the
+access key for the specific account, or you are using the wrong URL
+
+```
+$ hadoop fs -ls abfs://container@abfswales2.dfs.core.windows.net/
+
+ls: Configuration property abfswales2.dfs.core.windows.net not found.
+```
+
+* Make sure that the URL is correct
+* Add the missing account key.
+
+
+### `No such file or directory when trying to list a container`
+
+There is no container of the given name. Either it has been mistyped
+or the container needs to be created.
+
+```
+$ hadoop fs -ls abfs://container@abfswales1.dfs.core.windows.net/
+
+ls: `abfs://container@abfswales1.dfs.core.windows.net/': No such file or directory
+```
+
+* Make sure that the URL is correct
+* Create the container if needed
+
+### "HTTP connection to https://login.microsoftonline.com/_something_ failed for getting token from AzureAD. Http response: 200 OK"
+
++ it has a content-type `text/html`, `text/plain`, `application/xml`
+
+The OAuth authentication page didn't fail with an HTTP error code, but it didn't return JSON either
+
+```
+$ bin/hadoop fs -ls abfs://container@abfswales1.dfs.core.windows.net/
+
+ ...
+
+ls: HTTP Error 200;
+  url='https://login.microsoftonline.com/02a07549-0a5f-4c91-9d76-53d172a638a2/oauth2/authorize'
+  AADToken: HTTP connection to
+  https://login.microsoftonline.com/02a07549-0a5f-4c91-9d76-53d172a638a2/oauth2/authorize
+  failed for getting token from AzureAD.
+  Unexpected response.
+  Check configuration, URLs and proxy settings.
+  proxies=none;
+  requestId='dd9d526c-8b3d-4b3f-a193-0cf021938600';
+  contentType='text/html; charset=utf-8';
+```
+
+Likely causes are configuration and networking:
+
+1. Authentication is failing, the caller is being served up the Azure Active Directory
+signon page for humans, even though it is a machine calling.
+1. The URL is wrong —it is pointing at a web page unrelated to OAuth2.0
+1. There's a proxy server in the way trying to return helpful instructions.
+
+## <a name="testing"></a> Testing ABFS
+
+See the relevant section in [Testing Azure](testing_azure.html).

+ 12 - 3
hadoop-tools/hadoop-azure/src/site/markdown/index.md

@@ -16,17 +16,26 @@
 
 <!-- MACRO{toc|fromDepth=1|toDepth=3} -->
 
+See also:
+
+* [ABFS](./abfs.html)
+* [Testing](./testing_azure.html)
+
 ## Introduction
 
-The hadoop-azure module provides support for integration with
+The `hadoop-azure` module provides support for integration with
 [Azure Blob Storage](http://azure.microsoft.com/en-us/documentation/services/storage/).
-The built jar file, named hadoop-azure.jar, also declares transitive dependencies
+The built jar file, named `hadoop-azure.jar`, also declares transitive dependencies
 on the additional artifacts it requires, notably the
 [Azure Storage SDK for Java](https://github.com/Azure/azure-storage-java).
 
 To make it part of Apache Hadoop's default classpath, simply make sure that
-HADOOP_OPTIONAL_TOOLS in hadoop-env.sh has 'hadoop-azure' in the list.
+`HADOOP_OPTIONAL_TOOLS`in `hadoop-env.sh` has `'hadoop-azure` in the list.
+Example:
 
+```bash
+export HADOOP_OPTIONAL_TOOLS="hadoop-azure,hadoop-azure-datalake"
+```
 ## Features
 
 * Read and write data stored in an Azure Blob Storage account.