How to read data into a databricks notebook from Azure blob using Azure Active Directory (AAD)

Question

I am trying to read data from some containers into my notebook and write them into the format of spark or pandas dataframe. There are some documentations about using account password, but how can I do it with Azure Active Directory?

Are you talking about access Azure Blob Storage or Azure Data Lake storage.?When you say there are some documents, could you please point me one article which talks about accessing Azure Blob storage using Active Directory? — CHEEKATLAPRADEEP
– CHEEKATLAPRADEEP, Commented Nov 19, 2019 at 10:43
Blob only. learn.microsoft.com/en-us/azure/machine-learning/… This is one example using azure blob account and password instead of AAD to log in, and for the password I meant account password. Actually I found out in newest edition of azure-storage-blob I could use BlobServiceClient to log in: pypi.org/project/azure-storage-blob. But I simply just use key vault to log in on databricks notebook and avoid AAD for now. — zzzk
– zzzk, Commented Nov 19, 2019 at 17:12
In the document you shared, use Azure storage account name and the account key (access key), they haven't used password anywhere. Could you please clarify on the ask? — CHEEKATLAPRADEEP
– CHEEKATLAPRADEEP, Commented Nov 20, 2019 at 6:14
Account key is password, like root key for the account. I am asking specifically how to log in with using AAD instead of account name + account key. — zzzk
– zzzk, Commented Nov 20, 2019 at 20:47

Community · Accepted Answer · 2020-06-20 09:12:55Z

2

Unfortunately, these are the supported methods in Databricks for accessing Azure Blob Storage:

Mount an Azure Blob storage container
Access Azure Blob storage directly
Access Azure Blob storage using the RDD API

Reference: Databricks - Azure Blob Storage

Hope this helps.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Nov 21, 2019 at 5:30

CHEEKATLAPRADEEP

12.8k1 gold badge22 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Peter Pan · Accepted Answer · 2019-11-22 09:23:22Z

There is several Azure offical documents about accessing Azure Blob using Azure AD, as below.

Authorize access to Azure blobs and queues using Azure Active Directory
Authorize access to blobs and queues with Azure Active Directory from a client application
Authorize with Azure Active Directory about Authorize requests to Azure Storage

Meanwile, here is my sample code to get the key (account password) of an Azure Storage account for using it in databricks.

from azure.common.credentials import ServicePrincipalCredentials
from azure.mgmt.storage import StorageManagementClient

# Please refer to the second document above to get these parameter values
credentials = ServicePrincipalCredentials(
    client_id='<your client id>',
    secret='<your client secret>',
    tenant='<your tenant id>'
)

subscription_id = '<your subscription id>'

client = StorageManagementClient(credentials, subscription_id)

resource_group_name = '<the resource group name of your storage account>'
account_name = '<your storage account name>'

# print(dir(client.storage_accounts))

keys_json_text = client.storage_accounts.list_keys(resource_group_name, account_name, raw=True).response.text

import json
keys_json = json.loads(keys_json_text)
# print(keys_json)
# {"keys":[{"keyName":"key1","value":"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx==","permissions":"FULL"},{"keyName":"key2","value":"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx==","permissions":"FULL"}]}'
key1 = keys_json['keys'][0]['value']
print(key1)

Then, you can use the account password above to follow the Azure Databricks offical document Data > Data Sources > Azure Blob Storage to read data.

Otherwise, you can refer to the Steps 1 & 2 of my answer for the other SO thread transform data in azure data factory using python data bricks to read data, as the code below.

from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import ContainerPermissions
from datetime import datetime, timedelta

account_name = '<your account name>'
account_key = '<your account key>' # the key comes from the code above
container_name = '<your container name>'

service = BaseBlobService(account_name=account_name, account_key=account_key)
token = service.generate_container_shared_access_signature(container_name, permission=ContainerPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),)

blob_name = '<your blob name of dataset>'
blob_url_with_token = f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}?{token}"

import pandas as pd

pdf = pd.read_json(blob_url_with_token)
df = spark.createDataFrame(pdf)

Collectives™ on Stack Overflow

How to read data into a databricks notebook from Azure blob using Azure Active Directory (AAD)

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related