How to store pandas dataframe data to azure blobs using python?

Question

I want to store processed data in pandas dataframe to azure blobs in parquet file format. But before uploading to blobs, I have to store it as parquet file in local disk and then upload. I want to write pyarrow.table into pyarrow.parquet.NativeFile and upload it directly. Can anyone help me with this. Below code is working fine:

import pyarrow as pa
import pyarrow.parquet as pq

battery_pq = pd.read_csv('test.csv')

######## SOme Data Processing

battery_pq = pa.Table.from_pandas(battery_pq)
pq.write_table(battery_pq,'example.parquet')
block_blob_service.create_blob_from_path(container_name,'example.parquet','example.parquet')

Need to create the file in memory(I/O file type object) and then upload it to blob.

To create an in-memory file-object you can use io.BytesIO docs.python.org/3/library/io.html#binary-i-o — Nihal Sangeeth
– Nihal Sangeeth, Commented Feb 13, 2019 at 7:30

Uwe L. Korn · Accepted Answer · 2019-02-13 08:56:48Z

5

You can either use io.BytesIO for this or alternatively Apache Arrow also provides its native implementation BufferOutputStream. The benefit of this is that this writes to the stream without the overhead of going through Python. Thus less copies are made and the GIL is released.

import pyarrow as pa
import pyarrow.parquet as pq

df = some pandas.DataFrame
table = pa.Table.from_pandas(df)
buf = pa.BufferOutputStream()
pq.write_table(table, buf)
block_blob_service.create_blob_from_bytes(
    container,
    "example.parquet",
    buf.getvalue().to_pybytes()
)

answered Feb 13, 2019 at 8:56

Uwe L. Korn

8,9341 gold badge37 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Peter Pan Over a year ago

Function [block_blob_service.create_blob_from_stream ](azure-storage.readthedocs.io/ref/…) also works for buf without getting bytes.

KG in Chicago Over a year ago

I don't think that block_blob_service.create_blob_from_bytes exists in the Python Azure API anymore. Is there a way to do this with the current classes / functions that exist in the API. learn.microsoft.com/en-us/python/api/azure-storage-blob/…

Roman · Accepted Answer · 2021-08-09 19:56:53Z

5

There's a new python SDK version. create_blob_from_bytes is now legacy

import pandas as pd
from azure.storage.blob import BlobServiceClient
from io import BytesIO

blob_service_client = BlobServiceClient.from_connection_string(blob_store_conn_str)
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_path)

parquet_file = BytesIO()
df.to_parquet(parquet_file, engine='pyarrow')
parquet_file.seek(0)  # change the stream position back to the beginning after writing

blob_client.upload_blob(
    data=parquet_file
)

answered Aug 9, 2021 at 19:56

Roman

9,50111 gold badges72 silver badges113 bronze badges

Collectives™ on Stack Overflow

How to store pandas dataframe data to azure blobs using python?

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related