Parquet file after upload to Azure Data Lake gen 2 not readable (Python)

Question

Hello stackoverflow community,

I am having some issues reading parquet files. The problems start after I upload the Parquet file to Azure Data Lake gen 2 using Python.

I am using the official Micorsoft documentation: https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-directory-file-acl-python

Besides the authentification, this part:

def upload_file_to_directory():
try:



    file_system_client = service_client.get_file_system_client(file_system="my-file-system")



    directory_client = file_system_client.get_directory_client("my-directory")
    
    file_client = directory_client.create_file("uploaded-file.txt")
    local_file = open("C:\\file-to-upload.txt",'r')



    file_contents = local_file.read()



    file_client.append_data(data=file_contents, offset=0, length=len(file_contents))



    file_client.flush_data(len(file_contents))



except Exception as e:
  print(e)

When I use the code to upload a small csv file, it works totally fine. The csv file is uploaded and when I download the file I can open it without any problems.

If I convert the same data frame to a small parquet file and upload the file, the upload works fine. But when I download the file and try to open it, I get the error message:

ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

If I read the Parquet fiel directly without uploading, it works fine.

Does anyone have a suggestion how I need to modify the code so I don't destroy my parquet file?

Thanks!

Steve Johnson · Accepted Answer · 2021-06-01 06:40:55Z

2

I'm not sure what's wrong with your code(It seems your code is not complete), you can have a try this code, it works on my side:

try:
    file_system_client = service_client.get_file_system_client(file_system="my-file-system")

    directory_client = file_system_client.get_directory_client("my-directory")

    file_client = directory_client.create_file("data.parquet")

    df = pd.DataFrame({'one': [-1, np.nan, 2.5],
                       'two': ['foo', 'bar', 'baz'],
                       'three': [True, False, True]},
                      index=list('abc')).to_parquet()

    file_client.append_data(data=df, offset=0, length=len(df))

    file_client.flush_data(len(df))

except Exception as e:
    print(e)

answered Jun 1, 2021 at 6:40

Steve Johnson

8,8081 gold badge11 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Aram Panasenco · Accepted Answer · 2021-12-15 07:36:11Z

0

I just resolved this error in my project today.

I am using pyarrow.parquet.write_table to write my Parquet file.

I was passing a native Python file object to the where parameter, which somehow caused the footer to never get written.

When I switched to using PyArrow output streams instead of native Python file objects, the footer got written correctly on stream close, which resolved this issue for me.

answered Dec 15, 2021 at 7:36

Aram Panasenco

1082 silver badges8 bronze badges

Collectives™ on Stack Overflow

Parquet file after upload to Azure Data Lake gen 2 not readable (Python)

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related