The hadoop-azure
module provides support for the Azure Data Lake Storage Gen2
storage layer through the "abfs" connector
To make it part of Apache Hadoop's default classpath, make sure that
HADOOP_OPTIONAL_TOOLS
environment variable has hadoop-azure
in the list,
on every machine in the cluster
export HADOOP_OPTIONAL_TOOLS=hadoop-azure
You can set this locally in your .profile
/.bashrc
, but note it won't
propagate to jobs running in-cluster.
wasb:
connector.FileSystem
interface.For details on ABFS, consult the following documents:
The Azure Storage data model presents 3 core concepts:
The ABFS connector connects to classic containers, or those created with Hierarchical Namespaces.
A key aspect of ADLS Gen 2 is its support for hierachical namespaces These are effectively directories and offer high performance rename and delete operations —something which makes a significant improvement in performance in query engines writing data to, including MapReduce, Spark, Hive, as well as DistCp.
This feature is only available if the container was created with "namespace" support.
You enable namespace support when creating a new Storage Account,
by checking the "Hierarchical Namespace" option in the Portal UI, or, when
creating through the command line, using the option --hierarchical-namespace true
You cannot enable Hierarchical Namespaces on an existing storage account
Containers in a storage account with Hierarchical Namespaces are
not (currently) readable through the wasb:
connector.
Some of the az storage
command line commands fail too, for example:
$ az storage container list --account-name abfswales1
Blob API is not yet supported for hierarchical namespace accounts. ErrorCode: BlobApiNotYetSupportedForHierarchicalNamespaceAccounts
The best documentation on getting started with Azure Datalake Gen2 with the abfs connector is Using Azure Data Lake Storage Gen2 with Azure HDInsight clusters
It includes instructions to create it from the Azure command line tool, which can be installed on Windows, MacOS (via Homebrew) and Linux (apt or yum).
The az storage subcommand
handles all storage commands, az storage account create
does the creation.
Until the ADLS gen2 API support is finalized, you need to add an extension to the ADLS command.
az extension add --name storage-preview
Check that all is well by verifying that the usage command includes --hierarchical-namespace
:
$ az storage account
usage: az storage account create [-h] [--verbose] [--debug]
[--output {json,jsonc,table,tsv,yaml,none}]
[--query JMESPATH] --resource-group
RESOURCE_GROUP_NAME --name ACCOUNT_NAME
[--sku {Standard_LRS,Standard_GRS,Standard_RAGRS,Standard_ZRS,Premium_LRS,Premium_ZRS}]
[--location LOCATION]
[--kind {Storage,StorageV2,BlobStorage,FileStorage,BlockBlobStorage}]
[--tags [TAGS [TAGS ...]]]
[--custom-domain CUSTOM_DOMAIN]
[--encryption-services {blob,file,table,queue} [{blob,file,table,queue} ...]]
[--access-tier {Hot,Cool}]
[--https-only [{true,false}]]
[--file-aad [{true,false}]]
[--hierarchical-namespace [{true,false}]]
[--bypass {None,Logging,Metrics,AzureServices} [{None,Logging,Metrics,AzureServices} ...]]
[--default-action {Allow,Deny}]
[--assign-identity]
[--subscription _SUBSCRIPTION]
You can list locations from az account list-locations
, which lists the
name to refer to in the --location
argument:
$ az account list-locations -o table
DisplayName Latitude Longitude Name
------------------- ---------- ----------- ------------------
East Asia 22.267 114.188 eastasia
Southeast Asia 1.283 103.833 southeastasia
Central US 41.5908 -93.6208 centralus
East US 37.3719 -79.8164 eastus
East US 2 36.6681 -78.3889 eastus2
West US 37.783 -122.417 westus
North Central US 41.8819 -87.6278 northcentralus
South Central US 29.4167 -98.5 southcentralus
North Europe 53.3478 -6.2597 northeurope
West Europe 52.3667 4.9 westeurope
Japan West 34.6939 135.5022 japanwest
Japan East 35.68 139.77 japaneast
Brazil South -23.55 -46.633 brazilsouth
Australia East -33.86 151.2094 australiaeast
Australia Southeast -37.8136 144.9631 australiasoutheast
South India 12.9822 80.1636 southindia
Central India 18.5822 73.9197 centralindia
West India 19.088 72.868 westindia
Canada Central 43.653 -79.383 canadacentral
Canada East 46.817 -71.217 canadaeast
UK South 50.941 -0.799 uksouth
UK West 53.427 -3.084 ukwest
West Central US 40.890 -110.234 westcentralus
West US 2 47.233 -119.852 westus2
Korea Central 37.5665 126.9780 koreacentral
Korea South 35.1796 129.0756 koreasouth
France Central 46.3772 2.3730 francecentral
France South 43.8345 2.1972 francesouth
Australia Central -35.3075 149.1244 australiacentral
Australia Central 2 -35.3075 149.1244 australiacentral2
Once a location has been chosen, create the account
az storage account create --verbose \
--name abfswales1 \
--resource-group devteam2 \
--kind StorageV2 \
--hierarchical-namespace true \
--location ukwest \
--sku Standard_LRS \
--https-only true \
--encryption-services blob \
--access-tier Hot \
--tags owner=engineering \
--assign-identity \
--output jsonc
The output of the command is a JSON file, whose primaryEndpoints
command
includes the name of the store endpoint:
{
"primaryEndpoints": {
"blob": "https://abfswales1.blob.core.windows.net/",
"dfs": "https://abfswales1.dfs.core.windows.net/",
"file": "https://abfswales1.file.core.windows.net/",
"queue": "https://abfswales1.queue.core.windows.net/",
"table": "https://abfswales1.table.core.windows.net/",
"web": "https://abfswales1.z35.web.core.windows.net/"
}
}
The abfswales1.dfs.core.windows.net
account is the name by which the
storage account will be referred to.
Now ask for the connection string to the store, which contains the account key
az storage account show-connection-string --name abfswales1
{
"connectionString": "DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=abfswales1;AccountKey=ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA=="
}
You then need to add the access key to your core-site.xml
, JCEKs file or
use your cluster management tool to set it the option fs.azure.account.key.STORAGE-ACCOUNT
to this value.
<property>
<name>fs.azure.account.key.abfswales1.dfs.core.windows.net</name>
<value>ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA==</value>
</property>
Creation through the portal is covered in Quickstart: Create an Azure Data Lake Storage Gen2 storage account
Key Steps
You have now created your storage account. Next, get the key for authentication for using the default "Shared Key" authentication.
An Azure storage account can have multiple containers, each with the container name as the userinfo field of the URI used to reference it.
For example, the container "container1" in the storage account just created
will have the URL abfs://container1@abfswales1.dfs.core.windows.net/
You can create a new container through the ABFS connector, by setting the option
fs.azure.createRemoteFileSystemDuringInitialization
to true
.
If the container does not exist, an attempt to list it with hadoop fs -ls
will fail
$ hadoop fs -ls abfs://container1@abfswales1.dfs.core.windows.net/
ls: `abfs://container1@abfswales1.dfs.core.windows.net/': No such file or directory
Enable remote FS creation and the second attempt succeeds, creating the container as it does so:
$ hadoop fs -D fs.azure.createRemoteFileSystemDuringInitialization=true \
-ls abfs://container1@abfswales1.dfs.core.windows.net/
This is useful for creating accounts on the command line, especially before
the az storage
command supports hierarchical namespaces completely.
You can use the Azure Storage Explorer
Any configuration can be specified generally (or as the default when accessing all accounts)
or can be tied to a specific account.
For example, an OAuth identity can be configured for use regardless of which
account is accessed with the property fs.azure.account.oauth2.client.id
or you can configure an identity to be used only for a specific storage account with
fs.azure.account.oauth2.client.id.<account_name>.dfs.core.windows.net
.
This is shown in the Authentication section.
Authentication for ABFS is ultimately granted by Azure Active Directory.
The concepts covered there are beyond the scope of this document to cover; developers are expected to have read and understood the concepts therein to take advantage of the different authentication mechanisms.
What is covered here, briefly, is how to configure the ABFS client to authenticate in different deployment situations.
The ABFS client can be deployed in different ways, with its authentication needs driven by them.
What can be changed is what secrets/credentials are used to authenticate the caller.
The authentication mechanism is set in fs.azure.account.auth.type
(or the account specific variant),
and, for the various OAuth options fs.azure.account.oauth.provider.type
All secrets can be stored in JCEKS files. These are encrypted and password protected —use them or a compatible Hadoop Key Management Store wherever possible
This is the simplest authentication mechanism of account + password.
The account name is inferred from the URL; the password, "key", retrieved from the XML/JCECKs configuration files.
<property>
<name>fs.azure.account.auth.type.abfswales1.dfs.core.windows.net</name>
<value>SharedKey</value>
<description>
</description>
</property>
<property>
<name>fs.azure.account.key.abfswales1.dfs.core.windows.net</name>
<value>ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA==</value>
<description>
The secret password. Never share these.
</description>
</property>
Note: The source of the account key can be changed through a custom key provider; one exists to execute a shell script to retrieve it.
OAuth 2.0 credentials of (client id, client secret, endpoint) are provided in the configuration/JCEKS file.
The specifics of this process is covered in hadoop-azure-datalake; the key names are slightly different here.
<property>
<name>fs.azure.account.auth.type</name>
<value>OAuth</value>
<description>
Use OAuth authentication
</description>
</property>
<property>
<name>fs.azure.account.oauth.provider.type</name>
<value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value>
<description>
Use client credentials
</description>
</property>
<property>
<name>fs.azure.account.oauth2.client.endpoint</name>
<value></value>
<description>
URL of OAuth endpoint
</description>
</property>
<property>
<name>fs.azure.account.oauth2.client.id</name>
<value></value>
<description>
Client ID
</description>
</property>
<property>
<name>fs.azure.account.oauth2.client.secret</name>
<value></value>
<description>
Secret
</description>
</property>
An OAuth 2.0 endpoint, username and password are provided in the configuration/JCEKS file.
<property>
<name>fs.azure.account.auth.type</name>
<value>OAuth</value>
<description>
Use OAuth authentication
</description>
</property>
<property>
<name>fs.azure.account.oauth.provider.type</name>
<value>org.apache.hadoop.fs.azurebfs.oauth2.UserPasswordTokenProvider</value>
<description>
Use user and password
</description>
</property>
<property>
<name>fs.azure.account.oauth2.client.endpoint</name>
<value></value>
<description>
URL of OAuth 2.0 endpoint
</description>
</property>
<property>
<name>fs.azure.account.oauth2.user.name</name>
<value></value>
<description>
username
</description>
</property>
<property>
<name>fs.azure.account.oauth2.user.password</name>
<value></value>
<description>
password for account
</description>
</property>
With an existing Oauth 2.0 token, make a request of the Active Directory endpoint
https://login.microsoftonline.com/Common/oauth2/token
for this token to be refreshed.
<property>
<name>fs.azure.account.auth.type</name>
<value>OAuth</value>
<description>
Use OAuth 2.0 authentication
</description>
</property>
<property>
<name>fs.azure.account.oauth.provider.type</name>
<value>org.apache.hadoop.fs.azurebfs.oauth2.RefreshTokenBasedTokenProvider</value>
<description>
Use the Refresh Token Provider
</description>
</property>
<property>
<name>fs.azure.account.oauth2.refresh.token</name>
<value></value>
<description>
Refresh token
</description>
</property>
<property>
<name>fs.azure.account.oauth2.client.id</name>
<value></value>
<description>
Optional Client ID
</description>
</property>
Azure Managed Identities, formerly "Managed Service Identities".
OAuth 2.0 tokens are issued by a special endpoint only accessible
from the executing VM (http://169.254.169.254/metadata/identity/oauth2/token
).
The issued credentials can be used to authenticate.
The Azure Portal/CLI is used to create the service identity.
<property>
<name>fs.azure.account.auth.type</name>
<value>OAuth</value>
<description>
Use OAuth authentication
</description>
</property>
<property>
<name>fs.azure.account.oauth.provider.type</name>
<value>org.apache.hadoop.fs.azurebfs.oauth2.MsiTokenProvider</value>
<description>
Use MSI for issuing OAuth tokens
</description>
</property>
<property>
<name>fs.azure.account.oauth2.msi.tenant</name>
<value></value>
<description>
Optional MSI Tenant ID
</description>
</property>
<property>
<name>fs.azure.account.oauth2.client.id</name>
<value></value>
<description>
Optional Client ID
</description>
</property>
A Custom OAuth 2.0 token provider supplies the ABFS connector with an OAuth 2.0
token when its getAccessToken()
method is invoked.
<property>
<name>fs.azure.account.auth.type</name>
<value>Custom</value>
<description>
Custom Authentication
</description>
</property>
<property>
<name>fs.azure.account.oauth.provider.type</name>
<value></value>
<description>
classname of Custom Authentication Provider
</description>
</property>
The declared class must implement org.apache.hadoop.fs.azurebfs.extensions.CustomTokenProviderAdaptee
and optionally org.apache.hadoop.fs.azurebfs.extensions.BoundDTExtension
.
The connector uses the JVM proxy settings to control its proxy setup.
See The Oracle Java documentation for the options to set.
As the connector uses HTTPS by default, the https.proxyHost
and https.proxyPort
options are those which must be configured.
In MapReduce jobs, including distcp, the proxy options must be set in both the
mapreduce.map.java.opts
and mapreduce.reduce.java.opts
.
# this variable is only here to avoid typing the same values twice.
# It's name is not important.
export DISTCP_PROXY_OPTS="-Dhttps.proxyHost=web-proxy.example.com -Dhttps.proxyPort=80"
hadoop distcp \
-D mapreduce.map.java.opts="$DISTCP_PROXY_OPTS" \
-D mapreduce.reduce.java.opts="$DISTCP_PROXY_OPTS" \
-update -skipcrccheck -numListstatusThreads 40 \
hdfs://namenode:8020/users/alice abfs://backups@account.dfs.core.windows.net/users/alice
Without these settings, even though access to ADLS may work from the command line,
distcp
access can fail with network errors.
As with other object stores, login secrets are valuable pieces of information. Organizations should have a process for safely sharing them.
Syncable
interfaces hsync()
and hflush()
operations are supported if
fs.azure.enable.flush
is set to true (default=true). With the Wasb connector,
this limited the number of times either call could be made to 50,000
HADOOP-15478.
If abfs has the a similar limit, then excessive use of sync/flush may
cause problems.As with all Azure storage services, the Azure Datalake Gen 2 store offers a fully consistent view of the store, with complete Create, Read, Update, and Delete consistency for data and metadata. (Compare and contrast with S3 which only offers Create consistency; S3Guard adds CRUD to metadata, but not the underlying data).
For containers with hierarchical namespaces, the scalability numbers are, in Big-O-notation, as follows:
Operation | Scalability |
---|---|
File Rename | O(1) |
File Delete | O(1) |
Directory Rename: | O(1) |
Directory Delete | O(1) |
For non-namespace stores, the scalability becomes:
Operation | Scalability |
---|---|
File Rename | O(1) |
File Delete | O(1) |
Directory Rename: | O(files) |
Directory Delete | O(files) |
That is: the more files there are, the slower directory operations get.
Further reading: Azure Storage Scalability Targets
The ABFS connector supports a number of limited-private/unstable extension points for third-parties to integrate their authentication and authorization services into the ABFS client.
CustomDelegationTokenManager
: adds ability to issue Hadoop Delegation Tokens.AbfsAuthorizer
permits client-side authorization of file operations.CustomTokenProviderAdaptee
: allows for custom provision of
Azure OAuth tokens.KeyProvider
.Consult the source in org.apache.hadoop.fs.azurebfs.extensions
and all associated tests to see how to make use of these extension points.
Warning These extension points are unstable.
Consult the javadocs for org.apache.hadoop.fs.azurebfs.constants.ConfigurationKeys
,
org.apache.hadoop.fs.azurebfs.constants.FileSystemConfigurations
and
org.apache.hadoop.fs.azurebfs.AbfsConfiguration
for the full list
of configuration options and their default values.
The problems associated with the connector usually come down to, in order
If you log org.apache.hadoop.fs.azurebfs.services
at DEBUG
then you will
see more details about any request which is failing.
One useful tool for debugging connectivity is the cloudstore storediag utility.
This validates the classpath, the settings, then tries to work with the filesystem.
bin/hadoop jar cloudstore-0.1-SNAPSHOT.jar storediag abfs://container@account.dfs.core.windows.net/
storediag
command cannot work with an abfs store, nothing else is likely to.storediag
store does successfully work, that does not guarantee that the classpath
or configuration on the rest of the cluster is also going to work, especially
in distributed applications. But it is at least a start.ClassNotFoundException: org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem
The hadoop-azure
JAR is not on the classpah.
java.lang.RuntimeException: java.lang.ClassNotFoundException:
Class org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2625)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3290)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3322)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:136)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3373)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3341)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:491)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
Caused by: java.lang.ClassNotFoundException:
Class org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2529)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2623)
... 16 more
Tip: if this is happening on the command line, you can turn on debug logging of the hadoop scripts:
export HADOOP_SHELL_SCRIPT_DEBUG=true
If this is happening on an application running within the cluster, it means
the cluster (somehow) needs to be configured so that the hadoop-azure
module and dependencies are on the classpath of deployed applications.
ClassNotFoundException: com.microsoft.azure.storage.StorageErrorCode
The azure-storage
JAR is not on the classpath.
Server failed to authenticate the request
The request wasn't authenticated while using the default shared-key authentication mechanism.
Operation failed: "Server failed to authenticate the request.
Make sure the value of Authorization header is formed correctly including the signature.",
403, HEAD, https://account.dfs.core.windows.net/container2?resource=filesystem&timeout=90
at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:135)
at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getFilesystemProperties(AbfsClient.java:209)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFilesystemProperties(AzureBlobFileSystemStore.java:259)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.fileSystemExists(AzureBlobFileSystem.java:859)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:110)
Causes include:
Configuration property _something_.dfs.core.windows.net not found
There's no fs.azure.account.key.
entry in your cluster configuration declaring the
access key for the specific account, or you are using the wrong URL
$ hadoop fs -ls abfs://container@abfswales2.dfs.core.windows.net/
ls: Configuration property abfswales2.dfs.core.windows.net not found.
No such file or directory when trying to list a container
There is no container of the given name. Either it has been mistyped or the container needs to be created.
$ hadoop fs -ls abfs://container@abfswales1.dfs.core.windows.net/
ls: `abfs://container@abfswales1.dfs.core.windows.net/': No such file or directory
text/html
, text/plain
, application/xml
The OAuth authentication page didn't fail with an HTTP error code, but it didn't return JSON either
$ bin/hadoop fs -ls abfs://container@abfswales1.dfs.core.windows.net/
...
ls: HTTP Error 200;
url='https://login.microsoftonline.com/02a07549-0a5f-4c91-9d76-53d172a638a2/oauth2/authorize'
AADToken: HTTP connection to
https://login.microsoftonline.com/02a07549-0a5f-4c91-9d76-53d172a638a2/oauth2/authorize
failed for getting token from AzureAD.
Unexpected response.
Check configuration, URLs and proxy settings.
proxies=none;
requestId='dd9d526c-8b3d-4b3f-a193-0cf021938600';
contentType='text/html; charset=utf-8';
Likely causes are configuration and networking:
See the relevant section in Testing Azure.