13

I need to read .parquet files into a Pandas DataFrame in Python on my local machine without downloading the files. The parquet files are stored on Azure blobs with hierarchical directory structure. I am doing something like following and I am not sure how to proceed :

from azure.storage.blob import BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string(connection_string)

blob_client = blob_service_client.get_blob_client(container="abc", blob="/xyz/pqr/folder_with_parquet_files")

I have used dummy names here for privacy concerns. Assuming the directory "folder_with_parquet_files" contains 'n' no. of parquet files, how can I read them into a single Pandas DataFrame?

1
  • One time, we just can red one parquet files from azure blob with get_blob_client. So I think we should do loop. Commented Aug 12, 2020 at 3:30

4 Answers 4

13

Hi you could use pandas and read parquet from stream. It colud be very helpful for small data set, sprak session is not required here. It could be the fastest way especially for testing purposes.

import pandas as pd
from io import BytesIO
from azure.storage.blob import ContainerClient

path = '/path_to_blob/..'
conn_string = <conn_string>
blob_name = f'{path}.parquet'

container = ContainerClient.from_connection_string(conn_str=conn_string, container_name=<name_of_container>)

blob_client = container.get_blob_client(blob=blob_name)
stream_downloader = blob_client.download_blob()
stream = BytesIO()
stream_downloader.readinto(stream)
processed_df = pd.read_parquet(stream, engine='pyarrow')
Sign up to request clarification or add additional context in comments.

Comments

3

Here is a very similar solution, but slightly different using the new method azure.storage.blob._download.StorageStreamDownloader.readall:

from io import BytesIO
from azure.storage.blob import BlobServiceClient

blob_service_client = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client(container="parquet")

downloaded_blob = container_client.download_blob(upload_name)
bytes_io = BytesIO(downloaded_blob.readall())
df = pd.read_parquet(bytes_io)

print(df.head())

Comments

1

There is a better way: apply parquet libraries to the remote file system around Azure Storage. This will support useful features such as batching and streaming, and generalize to other cloud providers (AWS etc.).

import adlfs
import pyarrow.dataset as ds
import os

ds_path = 'omaha-datasets/showdowns_02/OMAHA/players_2/street_2/data/0'
fs = adlfs.AzureBlobFileSystem(account_name=os.environ['AZURE_STORAGE_ACCOUNT'], account_key=os.environ['AZURE_STORAGE_KEY'])
dataset = ds.dataset(ds_path, format="parquet", filesystem=fs)
print(len(dataset.files)) # ~ 3k files in my case
next( dataset.to_batches(batch_size=5) ).to_pandas() # first 5 rows

Comments

0

get_blob_to_bytes method can be used

Here the file is fetched from blob storage and held in memory. Pandas can then read this byte array as parquet format.

from azure.storage.blob import BlockBlobService
import pandas as pd
from io import BytesIO

#Source account and key
source_account_name = 'testdata'
source_account_key ='****************'

SOURCE_CONTAINER = 'my-data'
eachFile = 'test/2021/oct/myfile.parq'

source_block_blob_service = BlockBlobService(account_name=source_account_name, account_key=source_account_key)


f = source_block_blob_service.get_blob_to_bytes(SOURCE_CONTAINER, eachFile)
df = pd.read_parquet(BytesIO(f.content))
print(df.shape)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.