0

Hello stackoverflow community,

I am having some issues reading parquet files. The problems start after I upload the Parquet file to Azure Data Lake gen 2 using Python.

I am using the official Micorsoft documentation: https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-directory-file-acl-python

Besides the authentification, this part:

def upload_file_to_directory():
try:



    file_system_client = service_client.get_file_system_client(file_system="my-file-system")



    directory_client = file_system_client.get_directory_client("my-directory")
    
    file_client = directory_client.create_file("uploaded-file.txt")
    local_file = open("C:\\file-to-upload.txt",'r')



    file_contents = local_file.read()



    file_client.append_data(data=file_contents, offset=0, length=len(file_contents))



    file_client.flush_data(len(file_contents))



except Exception as e:
  print(e)

When I use the code to upload a small csv file, it works totally fine. The csv file is uploaded and when I download the file I can open it without any problems.

If I convert the same data frame to a small parquet file and upload the file, the upload works fine. But when I download the file and try to open it, I get the error message:

ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

If I read the Parquet fiel directly without uploading, it works fine.

Does anyone have a suggestion how I need to modify the code so I don't destroy my parquet file?

Thanks!

2 Answers 2

2

I'm not sure what's wrong with your code(It seems your code is not complete), you can have a try this code, it works on my side:

try:
    file_system_client = service_client.get_file_system_client(file_system="my-file-system")

    directory_client = file_system_client.get_directory_client("my-directory")

    file_client = directory_client.create_file("data.parquet")

    df = pd.DataFrame({'one': [-1, np.nan, 2.5],
                       'two': ['foo', 'bar', 'baz'],
                       'three': [True, False, True]},
                      index=list('abc')).to_parquet()

    file_client.append_data(data=df, offset=0, length=len(df))

    file_client.flush_data(len(df))

except Exception as e:
    print(e)
Sign up to request clarification or add additional context in comments.

Comments

0

I just resolved this error in my project today.

I am using pyarrow.parquet.write_table to write my Parquet file.

I was passing a native Python file object to the where parameter, which somehow caused the footer to never get written.

When I switched to using PyArrow output streams instead of native Python file objects, the footer got written correctly on stream close, which resolved this issue for me.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.