How to read parquet files from Azure Blobs into Pandas DataFrame?

Question

I need to read .parquet files into a Pandas DataFrame in Python on my local machine without downloading the files. The parquet files are stored on Azure blobs with hierarchical directory structure. I am doing something like following and I am not sure how to proceed :

from azure.storage.blob import BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string(connection_string)

blob_client = blob_service_client.get_blob_client(container="abc", blob="/xyz/pqr/folder_with_parquet_files")

I have used dummy names here for privacy concerns. Assuming the directory "folder_with_parquet_files" contains 'n' no. of parquet files, how can I read them into a single Pandas DataFrame?

One time, we just can red one parquet files from azure blob with get_blob_client. So I think we should do loop. — Jim Xu
– Jim Xu, Commented Aug 12, 2020 at 3:30

Erfan · Accepted Answer · 2021-10-23 15:48:53Z

13

Hi you could use pandas and read parquet from stream. It colud be very helpful for small data set, sprak session is not required here. It could be the fastest way especially for testing purposes.

import pandas as pd
from io import BytesIO
from azure.storage.blob import ContainerClient

path = '/path_to_blob/..'
conn_string = <conn_string>
blob_name = f'{path}.parquet'

container = ContainerClient.from_connection_string(conn_str=conn_string, container_name=<name_of_container>)

blob_client = container.get_blob_client(blob=blob_name)
stream_downloader = blob_client.download_blob()
stream = BytesIO()
stream_downloader.readinto(stream)
processed_df = pd.read_parquet(stream, engine='pyarrow')

edited Oct 23, 2021 at 15:48

Erfan

43.3k10 gold badges75 silver badges86 bronze badges

answered Nov 10, 2020 at 13:28

LaTreb

1631 silver badge6 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Erfan · Accepted Answer · 2021-10-23 20:01:44Z

3

Here is a very similar solution, but slightly different using the new method azure.storage.blob._download.StorageStreamDownloader.readall:

from io import BytesIO
from azure.storage.blob import BlobServiceClient

blob_service_client = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client(container="parquet")

downloaded_blob = container_client.download_blob(upload_name)
bytes_io = BytesIO(downloaded_blob.readall())
df = pd.read_parquet(bytes_io)

print(df.head())

answered Oct 23, 2021 at 20:01

Erfan

43.3k10 gold badges75 silver badges86 bronze badges

Comments

Maciej Skorski · Accepted Answer · 2023-08-19 11:15:20Z

1

There is a better way: apply parquet libraries to the remote file system around Azure Storage. This will support useful features such as batching and streaming, and generalize to other cloud providers (AWS etc.).

import adlfs
import pyarrow.dataset as ds
import os

ds_path = 'omaha-datasets/showdowns_02/OMAHA/players_2/street_2/data/0'
fs = adlfs.AzureBlobFileSystem(account_name=os.environ['AZURE_STORAGE_ACCOUNT'], account_key=os.environ['AZURE_STORAGE_KEY'])
dataset = ds.dataset(ds_path, format="parquet", filesystem=fs)
print(len(dataset.files)) # ~ 3k files in my case
next( dataset.to_batches(batch_size=5) ).to_pandas() # first 5 rows

answered Aug 19, 2023 at 11:15

Maciej Skorski

3,5331 gold badge15 silver badges23 bronze badges

Comments

ns15 · Accepted Answer · 2021-11-04 10:43:08Z

0

get_blob_to_bytes method can be used

Here the file is fetched from blob storage and held in memory. Pandas can then read this byte array as parquet format.

from azure.storage.blob import BlockBlobService
import pandas as pd
from io import BytesIO

#Source account and key
source_account_name = 'testdata'
source_account_key ='****************'

SOURCE_CONTAINER = 'my-data'
eachFile = 'test/2021/oct/myfile.parq'

source_block_blob_service = BlockBlobService(account_name=source_account_name, account_key=source_account_key)


f = source_block_blob_service.get_blob_to_bytes(SOURCE_CONTAINER, eachFile)
df = pd.read_parquet(BytesIO(f.content))
print(df.shape)

answered Nov 4, 2021 at 10:43

ns15

9,3521 gold badge71 silver badges70 bronze badges

Collectives™ on Stack Overflow

How to read parquet files from Azure Blobs into Pandas DataFrame?

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related