2

I am writing Parquet files using two different frameworks—Apache Spark (Scala) and Polars (Python)—with the same schema and data. However, when I query the resulting Parquet files using Apache DataFusion, I notice a significant performance difference:

Queries run faster on the Parquet file written by Polars Queries take longer on the Parquet file written by Spark I expected similar performance since the schema and data remain unchanged. I am trying to understand why this discrepancy occurs.

Here are some details about my setup:

Spark version: 3.5.0 Polars version: 1.24.0

Parquet write options:

Spark: df.write.parquet("path") Polars: df.write_parquet("path")

I tried changing the compression for spark too but was not able to achieve the same results as the parquet from Polars.

Has anyone experienced a similar issue? What aspects of Spark's and Polars' Parquet writing might cause this performance difference? Are there specific configurations I should check when writing Parquet in either framework?

These are some configs I tried adjusting for Spark before writing too

 .config("spark.sql.parquet.compression.codec", "zstd")  
  .config("parquet.enable.dictionary", "true")  
  .config("parquet.dictionary.pageSize", 1048576) 
  .config("parquet.block.size", 4 * 1024 * 1024)  // Smaller row groups (4MB) for DataFusion 
  .config("parquet.page.size", 128 * 1024)  
  .config("parquet.writer.version", "PARQUET_2_0")  
  .config("parquet.int96RebaseModeInWrite", "CORRECTED")  
  .config("spark.sql.parquet.mergeSchema", "false")  
  .config("parquet.column.index.enabled", "true")  
  .config("parquet.column.index.pageSize", "64 * 1024") 
  .config("parquet.statistics.enabled", "true")  
  .config("parquet.int64.timestats", "false")  
  .config("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")  
  .config("spark.sql.parquet.filterPushdown", "true") 
3
  • 3
    Have you tried querying the metadata and statistics of the resulting files? Commented Mar 14 at 23:01
  • It's probably different row group sizes. Too many row groups and there's too much overhead. In big enough files, too few row groups means the reader can't parallelize where it might otherwise be able to. Commented Mar 15 at 11:31
  • 2
    1. Have you checked the file count and file sizes in each case? 2. Are you using the same partitioning for both? Commented Mar 15 at 11:39

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.