For the same compression level of 10 using zstd with Parquet, I get significantly better performance in Pandas than from Apache Spark. The following files, for example, were first generated using Spark and then loaded and saved using python shell. Why is there so much discrepancy in performance? I am using native filesystem (ext4) on Ubuntu.
Within Spark:
df
.coalesce(1)
.write
.option("compression", "zstd")
.option("compressionLevel", "10")
.mode("overwrite")
.parquet(parquetPath)
In Python:
>>> import pandas as pd
>>> df = pd.read_parquet('results/data1.parquet')
>>> df.to_parquet('data1.parquet', engine='pyarrow', compression="zstd", compression_level=10, index=False)
>>> df = pd.read_parquet('results/data2.parquet')
>>> df.to_parquet('data2.parquet', engine='pyarrow', compression="zstd", compression_level=10, index=False)
Stats:
file | Apache Spark | Pandas | Pandas/Spark sizes
---------------------------------------------------------------------
results/data1.parquet | 237780532 | 172442433 | 0.72
results/data2.parquet | 62052301 | 41917063 | 0.67
Software:
Apache Spark-4.0.0-preview1
scala-2.13.14
Java 21.0.4
python-3.12.4
pandas-2.2.3
pyarrow-19.0.1
PS: Metadata information is as follows
pqt$ ls -l
-r--r--r-- 1 user user 237780532 May 20 11:55 spark.parquet
pqt$ python3
Python 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 15:12:24) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.parquet as pq
>>> import pandas as pd
>>> df = pd.read_parquet('spark.parquet')
>>> df.to_parquet('pandas.parquet', engine='pyarrow', compression="zstd", compression_level=10, index=False)
>>> parquet_file = pq.ParquetFile('spark.parquet')
>>> print(parquet_file.metadata)
<pyarrow._parquet.FileMetaData object at 0x11e8091de2a0>
created_by: parquet-mr version 1.13.1 (build db4183109d5b734ec5930d870cdae161e408ddba)
num_columns: 3
num_rows: 20000
num_row_groups: 2
format_version: 1.0
serialized_size: 1001
>>> parquet_file = pq.ParquetFile('pandas.parquet')
>>> print(parquet_file.metadata)
<pyarrow._parquet.FileMetaData object at 0x11e808d57fb0>
created_by: parquet-cpp-arrow version 19.0.1
num_columns: 3
num_rows: 20000
num_row_groups: 1
format_version: 2.6
serialized_size: 1905
>>>
pqt$ ls -l
-rw-rw-r-- 1 user user 172442433 May 20 12:01 pandas.parquet
-r--r--r-- 1 user user 237780532 May 20 11:55 spark.parquet
pqt$ bc
scale = 2
172442433 / 237780532
.72
pqt$
parquet-tools meta <file>(renamedparquet-cliin later versions) to review and compare file structure, including column encodings and sizes. github.com/apache/parquet-java/tree/parquet-1.10.x/parquet-cli