4
ArrowInvalid: Unable to merge: Field X has incompatible types: string vs dictionary<values=string, indices=int32, ordered=0>

ArrowInvalid: Unable to merge: Field X has incompatible types: decimal vs int32

I am trying to write the result of a snowflake query on disk and then query that data using arrow and duckdb. I have created a partitioned parquet with the query bellow following this:

COPY INTO 's3://path/to/folder/'
FROM (
    SELECT transaction.TRANSACTION_ID, OUTPUT_SCORE, MODEL_NAME, ACCOUNT_ID, to_char(TRANSACTION_DATE,'YYYY-MM') as SCORE_MTH
    FROM transaction
    )
partition by('SCORE_MTH=' || score_mth || '/ACCOUNT_ID=' || ACCOUNT_ID)
file_format = (type=parquet)
header=true

When I try to read the parquet files I get the following error:

df = pd.read_parquet('path/to/parquet/') # same result using pq.ParquetDataset or pq.read_table as they all use the same function under the hood

ArrowInvalid: Unable to merge: Field SCORE_MTH has incompatible types: string vs dictionary<values=string, indices=int32, ordered=0>

Moreover, following some google search I found this page. Following the instructions: df = pd.read_parquet('path/to/parquet/', use_legacy_dataset=True)

ValueError: Schema in partition[SCORE_MTH=0, ACCOUNT_ID=0] /path/to/parquet was different. 
TRANSACTION_ID: string not null
OUTPUT_SCORE: double
MODEL_NAME: string
ACCOUNT_ID: int32
SCORE_MTH: string

vs

TRANSACTION_ID: string not null
OUTPUT_SCORE: double
MODEL_NAME: string

Also based on what the data type is you may get this error:

ArrowInvalid: Unable to merge: Field X has incompatible types: IntegerType vs DoubleType

or

ArrowInvalid: Unable to merge: Field X has incompatible types: decimal vs int32

This is a known issue.

Any idea how I can read this parquet file?

3 Answers 3

1

I was just dealing with the same issue and for me it worked if I provided the pyarrow schema to the function:

import pandas as pd
import pyarrow as pa

schema = pa.schema([('SCORE_MTH', pa.string()), ('ACCOUNT_ID', pa.int32())])
pd.read_parquet('s3://path/to/folder//', schema=schema)  # works also with filters
Sign up to request clarification or add additional context in comments.

1 Comment

Yeah that works. However, you need to know the data types in the first place. Also, a bit uncomfortable if you have a lot of columns
1

My workaround was to use fastparquet instead of pyarrow to read it. Just pip install fastparquet and then:

df = pd.read_parquet('path/to/parquet/', engine="fastparquet")

Comments

0

The only work around I found that works is this:

import pyarrow.dataset as ds
dataset = ds.dataset('/path/to/parquet/', format="parquet", partitioning="hive")

then you can query directly using duckdb:

import duckdb
con = duckdb.connect()
pandas_df = con.execute("Select * from dataset").df()

Also if you want a pandas dataframe you can do this:

dataset.to_table().to_pandas()

Note that to_table() will load the whole dataset into memory.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.