|
@@ -14,28 +14,15 @@
|
|
|
|
|
|
# Hadoop Azure Data Lake Support
|
|
|
|
|
|
-* [Introduction](#Introduction)
|
|
|
-* [Features](#Features)
|
|
|
-* [Limitations](#Limitations)
|
|
|
-* [Usage](#Usage)
|
|
|
- * [Concepts](#Concepts)
|
|
|
- * [OAuth2 Support](#OAuth2_Support)
|
|
|
- * [Configuring Credentials and FileSystem](#Configuring_Credentials)
|
|
|
- * [Using Refresh Token](#Refresh_Token)
|
|
|
- * [Using Client Keys](#Client_Credential_Token)
|
|
|
- * [Protecting the Credentials with Credential Providers](#Credential_Provider)
|
|
|
- * [Enabling ADL Filesystem](#Enabling_ADL)
|
|
|
- * [Accessing `adl` URLs](#Accessing_adl_URLs)
|
|
|
- * [User/Group Representation](#OIDtoUPNConfiguration)
|
|
|
-* [Testing the `hadoop-azure` Module](#Testing_the_hadoop-azure_Module)
|
|
|
-
|
|
|
-## <a name="Introduction" />Introduction
|
|
|
+<!-- MACRO{toc|fromDepth=1|toDepth=3} -->
|
|
|
+
|
|
|
+## Introduction
|
|
|
|
|
|
The `hadoop-azure-datalake` module provides support for integration with the
|
|
|
[Azure Data Lake Store](https://azure.microsoft.com/en-in/documentation/services/data-lake-store/).
|
|
|
This support comes via the JAR file `azure-datalake-store.jar`.
|
|
|
|
|
|
-## <a name="Features" />Features
|
|
|
+## Features
|
|
|
|
|
|
* Read and write data stored in an Azure Data Lake Storage account.
|
|
|
* Reference file system paths using URLs using the `adl` scheme for Secure Webhdfs i.e. SSL
|
|
@@ -46,7 +33,7 @@ This support comes via the JAR file `azure-datalake-store.jar`.
|
|
|
* API `setOwner()`, `setAcl`, `removeAclEntries()`, `modifyAclEntries()` accepts UPN or OID
|
|
|
(Object ID) as user and group names.
|
|
|
|
|
|
-## <a name="Limitations" />Limitations
|
|
|
+## Limitations
|
|
|
|
|
|
Partial or no support for the following operations :
|
|
|
|
|
@@ -62,9 +49,9 @@ Partial or no support for the following operations :
|
|
|
* User and group information returned as `listStatus()` and `getFileStatus()` is
|
|
|
in the form of the GUID associated in Azure Active Directory.
|
|
|
|
|
|
-## <a name="Usage" />Usage
|
|
|
+## Usage
|
|
|
|
|
|
-### <a name="Concepts" />Concepts
|
|
|
+### Concepts
|
|
|
Azure Data Lake Storage access path syntax is:
|
|
|
|
|
|
```
|
|
@@ -74,7 +61,7 @@ adl://<Account Name>.azuredatalakestore.net/
|
|
|
For details on using the store, see
|
|
|
[**Get started with Azure Data Lake Store using the Azure Portal**](https://azure.microsoft.com/en-in/documentation/articles/data-lake-store-get-started-portal/)
|
|
|
|
|
|
-### <a name="#OAuth2_Support" />OAuth2 Support
|
|
|
+#### OAuth2 Support
|
|
|
|
|
|
Usage of Azure Data Lake Storage requires an OAuth2 bearer token to be present as
|
|
|
part of the HTTPS header as per the OAuth2 specification.
|
|
@@ -86,11 +73,11 @@ and identity management service. See [*What is ActiveDirectory*](https://azure.m
|
|
|
|
|
|
Following sections describes theOAuth2 configuration in `core-site.xml`.
|
|
|
|
|
|
-#### <a name="Configuring_Credentials" />Configuring Credentials & FileSystem
|
|
|
+### Configuring Credentials and FileSystem
|
|
|
Credentials can be configured using either a refresh token (associated with a user),
|
|
|
or a client credential (analogous to a service principal).
|
|
|
|
|
|
-#### <a name="Refresh_Token" />Using Refresh Tokens
|
|
|
+#### Using Refresh Tokens
|
|
|
|
|
|
Add the following properties to the cluster's `core-site.xml`
|
|
|
|
|
@@ -119,9 +106,9 @@ service associated with the client id. See [*Active Directory Library For Java*]
|
|
|
```
|
|
|
|
|
|
|
|
|
-### <a name="Client_Credential_Token" />Using Client Keys
|
|
|
+#### Using Client Keys
|
|
|
|
|
|
-#### Generating the Service Principal
|
|
|
+##### Generating the Service Principal
|
|
|
|
|
|
1. Go to [the portal](https://portal.azure.com)
|
|
|
2. Under "Browse", look for Active Directory and click on it.
|
|
@@ -135,13 +122,13 @@ service associated with the client id. See [*Active Directory Library For Java*]
|
|
|
- The token endpoint (select "View endpoints" at the bottom of the page and copy/paste the OAuth2 .0 Token Endpoint value)
|
|
|
- Resource: Always https://management.core.windows.net/ , for all customers
|
|
|
|
|
|
-#### Adding the service principal to your ADL Account
|
|
|
+##### Adding the service principal to your ADL Account
|
|
|
1. Go to the portal again, and open your ADL account
|
|
|
2. Select Users under Settings
|
|
|
3. Add your user name you created in Step 6 above (note that it does not show up in the list, but will be found if you searched for the name)
|
|
|
4. Add "Owner" role
|
|
|
|
|
|
-### Configure core-site.xml
|
|
|
+##### Configure core-site.xml
|
|
|
Add the following properties to your `core-site.xml`
|
|
|
|
|
|
```xml
|
|
@@ -161,7 +148,7 @@ Add the following properties to your `core-site.xml`
|
|
|
</property>
|
|
|
```
|
|
|
|
|
|
-### <a name="Credential_Provider" />Protecting the Credentials with Credential Providers
|
|
|
+#### Protecting the Credentials with Credential Providers
|
|
|
|
|
|
In many Hadoop clusters, the `core-site.xml` file is world-readable. To protect
|
|
|
these credentials, it is recommended that you use the
|
|
@@ -171,7 +158,7 @@ All ADLS credential properties can be protected by credential providers.
|
|
|
For additional reading on the credential provider API, see
|
|
|
[Credential Provider API](../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html).
|
|
|
|
|
|
-#### Provisioning
|
|
|
+##### Provisioning
|
|
|
|
|
|
```bash
|
|
|
hadoop credential create dfs.adls.oauth2.refresh.token -value 123
|
|
@@ -180,7 +167,7 @@ hadoop credential create dfs.adls.oauth2.credential -value 123
|
|
|
-provider localjceks://file/home/foo/adls.jceks
|
|
|
```
|
|
|
|
|
|
-#### Configuring core-site.xml or command line property
|
|
|
+##### Configuring core-site.xml or command line property
|
|
|
|
|
|
```xml
|
|
|
<property>
|
|
@@ -190,7 +177,7 @@ hadoop credential create dfs.adls.oauth2.credential -value 123
|
|
|
</property>
|
|
|
```
|
|
|
|
|
|
-#### Running DistCp
|
|
|
+##### Running DistCp
|
|
|
|
|
|
```bash
|
|
|
hadoop distcp
|
|
@@ -203,7 +190,7 @@ NOTE: You may optionally add the provider path property to the `distcp` command
|
|
|
line instead of added job specific configuration to a generic `core-site.xml`.
|
|
|
The square brackets above illustrate this capability.`
|
|
|
|
|
|
-### <a name="Accessing_adl_URLs" />Accessing adl URLs
|
|
|
+### Accessing adl URLs
|
|
|
|
|
|
After credentials are configured in `core-site.xml`, any Hadoop component may
|
|
|
reference files in that Azure Data Lake Storage account by using URLs of the following
|
|
@@ -230,7 +217,7 @@ hadoop fs -put testFile adl://yourcontainer.azuredatalakestore.net/testDir/testF
|
|
|
hadoop fs -cat adl://yourcontainer.azuredatalakestore.net/testDir/testFile
|
|
|
test file content
|
|
|
```
|
|
|
-### <a name="OIDtoUPNConfiguration" />User/Group Representation
|
|
|
+### User/Group Representation
|
|
|
|
|
|
The `hadoop-azure-datalake` module provides support for configuring how
|
|
|
User/Group information is represented during
|
|
@@ -254,7 +241,7 @@ Add the following properties to `core-site.xml`
|
|
|
</description>
|
|
|
</property>
|
|
|
```
|
|
|
-## <a name="Testing_the_hadoop-azure_Module" />Testing the azure-datalake-store Module
|
|
|
+## Testing the azure-datalake-store Module
|
|
|
The `hadoop-azure` module includes a full suite of unit tests.
|
|
|
Most of the tests will run without additional configuration by running mvn test.
|
|
|
This includes tests against mocked storage, which is an in-memory emulation of Azure Data Lake Storage.
|