Why does a Parquet file written with Polars query faster than one written with Spark?

Ask Question

Asked 8 months ago

Modified 8 months ago

Viewed 189 times

I am writing Parquet files using two different frameworks—Apache Spark (Scala) and Polars (Python)—with the same schema and data. However, when I query the resulting Parquet files using Apache DataFusion, I notice a significant performance difference:

Queries run faster on the Parquet file written by Polars Queries take longer on the Parquet file written by Spark I expected similar performance since the schema and data remain unchanged. I am trying to understand why this discrepancy occurs.

Here are some details about my setup:

Spark version: 3.5.0 Polars version: 1.24.0

Parquet write options:

Spark: df.write.parquet("path") Polars: df.write_parquet("path")

I tried changing the compression for spark too but was not able to achieve the same results as the parquet from Polars.

Has anyone experienced a similar issue? What aspects of Spark's and Polars' Parquet writing might cause this performance difference? Are there specific configurations I should check when writing Parquet in either framework?

These are some configs I tried adjusting for Spark before writing too

 .config("spark.sql.parquet.compression.codec", "zstd")  
  .config("parquet.enable.dictionary", "true")  
  .config("parquet.dictionary.pageSize", 1048576) 
  .config("parquet.block.size", 4 * 1024 * 1024)  // Smaller row groups (4MB) for DataFusion 
  .config("parquet.page.size", 128 * 1024)  
  .config("parquet.writer.version", "PARQUET_2_0")  
  .config("parquet.int96RebaseModeInWrite", "CORRECTED")  
  .config("spark.sql.parquet.mergeSchema", "false")  
  .config("parquet.column.index.enabled", "true")  
  .config("parquet.column.index.pageSize", "64 * 1024") 
  .config("parquet.statistics.enabled", "true")  
  .config("parquet.int64.timestats", "false")  
  .config("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")  
  .config("spark.sql.parquet.filterPushdown", "true")

edited Mar 14 at 22:48

jqurious

24.2k6 gold badges24 silver badges43 bronze badges

asked Mar 14 at 22:00

user29976558

211 bronze badge

3

Have you tried querying the metadata and statistics of the resulting files?

Botje
– Botje

2025-03-14 23:01:33 +00:00
Commented Mar 14 at 23:01
It's probably different row group sizes. Too many row groups and there's too much overhead. In big enough files, too few row groups means the reader can't parallelize where it might otherwise be able to.

Dean MacGregor
– Dean MacGregor

2025-03-15 11:31:40 +00:00
Commented Mar 15 at 11:31
2

1. Have you checked the file count and file sizes in each case? 2. Are you using the same partitioning for both?

Frank
– Frank

2025-03-15 11:39:38 +00:00
Commented Mar 15 at 11:39

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Why does a Parquet file written with Polars query faster than one written with Spark?

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest