4

I want to store processed data in pandas dataframe to azure blobs in parquet file format. But before uploading to blobs, I have to store it as parquet file in local disk and then upload. I want to write pyarrow.table into pyarrow.parquet.NativeFile and upload it directly. Can anyone help me with this. Below code is working fine:

import pyarrow as pa
import pyarrow.parquet as pq

battery_pq = pd.read_csv('test.csv')
######## SOme Data Processing
battery_pq = pa.Table.from_pandas(battery_pq)
pq.write_table(battery_pq,'example.parquet')
block_blob_service.create_blob_from_path(container_name,'example.parquet','example.parquet')

Need to create the file in memory(I/O file type object) and then upload it to blob.

1

2 Answers 2

5

You can either use io.BytesIO for this or alternatively Apache Arrow also provides its native implementation BufferOutputStream. The benefit of this is that this writes to the stream without the overhead of going through Python. Thus less copies are made and the GIL is released.

import pyarrow as pa
import pyarrow.parquet as pq

df = some pandas.DataFrame
table = pa.Table.from_pandas(df)
buf = pa.BufferOutputStream()
pq.write_table(table, buf)
block_blob_service.create_blob_from_bytes(
    container,
    "example.parquet",
    buf.getvalue().to_pybytes()
)
Sign up to request clarification or add additional context in comments.

2 Comments

Function [block_blob_service.create_blob_from_stream ](azure-storage.readthedocs.io/ref/…) also works for buf without getting bytes.
I don't think that block_blob_service.create_blob_from_bytes exists in the Python Azure API anymore. Is there a way to do this with the current classes / functions that exist in the API. learn.microsoft.com/en-us/python/api/azure-storage-blob/…
5

There's a new python SDK version. create_blob_from_bytes is now legacy

import pandas as pd
from azure.storage.blob import BlobServiceClient
from io import BytesIO

blob_service_client = BlobServiceClient.from_connection_string(blob_store_conn_str)
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_path)

parquet_file = BytesIO()
df.to_parquet(parquet_file, engine='pyarrow')
parquet_file.seek(0)  # change the stream position back to the beginning after writing

blob_client.upload_blob(
    data=parquet_file
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.