problem with reading partitioned parquet files created by Snowflake with pandas or arrow

Question

ArrowInvalid: Unable to merge: Field X has incompatible types: string vs dictionary<values=string, indices=int32, ordered=0>

ArrowInvalid: Unable to merge: Field X has incompatible types: decimal vs int32

I am trying to write the result of a snowflake query on disk and then query that data using arrow and duckdb. I have created a partitioned parquet with the query bellow following this:

COPY INTO 's3://path/to/folder/'
FROM (
    SELECT transaction.TRANSACTION_ID, OUTPUT_SCORE, MODEL_NAME, ACCOUNT_ID, to_char(TRANSACTION_DATE,'YYYY-MM') as SCORE_MTH
    FROM transaction
    )
partition by('SCORE_MTH=' || score_mth || '/ACCOUNT_ID=' || ACCOUNT_ID)
file_format = (type=parquet)
header=true

When I try to read the parquet files I get the following error:

df = pd.read_parquet('path/to/parquet/') # same result using pq.ParquetDataset or pq.read_table as they all use the same function under the hood

ArrowInvalid: Unable to merge: Field SCORE_MTH has incompatible types: string vs dictionary<values=string, indices=int32, ordered=0>

Moreover, following some google search I found this page. Following the instructions: df = pd.read_parquet('path/to/parquet/', use_legacy_dataset=True)

ValueError: Schema in partition[SCORE_MTH=0, ACCOUNT_ID=0] /path/to/parquet was different. 
TRANSACTION_ID: string not null
OUTPUT_SCORE: double
MODEL_NAME: string
ACCOUNT_ID: int32
SCORE_MTH: string

vs

TRANSACTION_ID: string not null
OUTPUT_SCORE: double
MODEL_NAME: string

Also based on what the data type is you may get this error:

ArrowInvalid: Unable to merge: Field X has incompatible types: IntegerType vs DoubleType

or

ArrowInvalid: Unable to merge: Field X has incompatible types: decimal vs int32

This is a known issue.

Any idea how I can read this parquet file?

MasterApprentice · Accepted Answer · 2024-05-16 10:03:50Z

1

I was just dealing with the same issue and for me it worked if I provided the pyarrow schema to the function:

import pandas as pd
import pyarrow as pa

schema = pa.schema([('SCORE_MTH', pa.string()), ('ACCOUNT_ID', pa.int32())])
pd.read_parquet('s3://path/to/folder//', schema=schema)  # works also with filters

answered May 16, 2024 at 10:03

MasterApprentice

211 silver badge3 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ehsan Fathi Over a year ago

Yeah that works. However, you need to know the data types in the first place. Also, a bit uncomfortable if you have a lot of columns

Daniel R Carletti · Accepted Answer · 2024-09-03 13:47:39Z

1

My workaround was to use fastparquet instead of pyarrow to read it. Just pip install fastparquet and then:

df = pd.read_parquet('path/to/parquet/', engine="fastparquet")

answered Sep 3, 2024 at 13:47

Daniel R Carletti

5696 silver badges15 bronze badges

Comments

Ehsan Fathi · Accepted Answer · 2022-11-07 20:54:59Z

0

The only work around I found that works is this:

import pyarrow.dataset as ds
dataset = ds.dataset('/path/to/parquet/', format="parquet", partitioning="hive")

then you can query directly using duckdb:

import duckdb
con = duckdb.connect()
pandas_df = con.execute("Select * from dataset").df()

Also if you want a pandas dataframe you can do this:

dataset.to_table().to_pandas()

Note that to_table() will load the whole dataset into memory.

answered Nov 7, 2022 at 20:54

Ehsan Fathi

7388 silver badges22 bronze badges

Collectives™ on Stack Overflow

problem with reading partitioned parquet files created by Snowflake with pandas or arrow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related