Parquet file larger than memory consumption of pandas DataFrame

Question

I am storing two different pandas DataFrames as parquet files (through kedro).

Both DataFrames have identical dimensions and dtypes (float32) before getting written to disk. Also, their memory consumption in RAM is identical:

distances_1.memory_usage(deep=True).sum()/1e9
# 3.730033604

distances_2.memory_usage(deep=True).sum()/1e9
# 3.730033604

When persisted as .parquet files, the first df results in a file of ~0.89GB and the second file results in a file ~4.5GB.

distances_1 has many more redundant values than distances_2 and thus compression might be more effective.

Loading the parquet files from disk into DataFrames results in valid data that is identical to the original DataFrames.

How can the big size difference between the files be explained?
For what reasons could the second file be larger than the in-memory data structure?

Wouldn't it be less confusing to translate RAM into usual units? — Wolf
– Wolf, Commented Mar 16, 2021 at 9:08
The code provided returns the total memory consumption of the dataframe in GB, right? I thought that would make it easy to compare it to the file sizes. — Nils Blum-Oeste
– Nils Blum-Oeste, Commented Mar 16, 2021 at 9:16
I see, it's only that sometimes (like in Windows Explorer) the unit 1GB means 2^30 Bytes. — Wolf
– Wolf, Commented Mar 16, 2021 at 9:27
Is this "many more redundant values" measurable in some way? — Wolf
– Wolf, Commented Mar 16, 2021 at 9:30
Of course you are right about the GB, the division by 1e9 is just an approximation, but I don't think this is crucial to the issue, is it? — Nils Blum-Oeste
– Nils Blum-Oeste, Commented Mar 16, 2021 at 10:03

Wolf · Accepted Answer · 2021-03-18 20:31:20Z

5

As you say, the number of unique values can have a very important role in parquet size.

Translating from pandas, two other factors that can have a surprisingly large effect on parquet file size are:

pandas indexes, which are saved by default even if they're just auto-assigned;
the sorting of your data, which can make a large difference in the run-length encoding parquet sometimes uses.

Shuffled, auto-assigned indices can take a lot of space. If you don't care about the sort order of data on disk, worrying about this can make a significant difference.

Consider four cases of a pandas frame with one column containing the same data in all cases: the rounded squares of the first 2**16 integers. Storing it without indexes in sorted form takes 2.9K; shuffled without the auto-assigned index takes 66K; auto-assigning an index then shuffling takes 475K.

import pandas as pd
import numpy as np
!mkdir -p /tmp/parquet
d = pd.DataFrame({"A": np.floor(np.sqrt(np.arange(2**16)))})

d.to_parquet("/tmp/parquet/straight.parquet")
d.to_parquet("/tmp/parquet/straight_no_index.parquet", index = False)
d.sample(frac = 1).to_parquet("/tmp/parquet/shuf.parquet")
d.sample(frac = 1).to_parquet("/tmp/parquet/shuf_no_index.parquet", index = False)

ls -lSh /tmp/parquet
-rw-r--r--  1 user  wheel   475K Mar 18 13:39 shuf.parquet
-rw-r--r--  1 user  wheel    66K Mar 18 13:39 shuf_no_index.parquet
-rw-r--r--  1 user  wheel   3.3K Mar 18 13:39 straight.parquet
-rw-r--r--  1 user  wheel   2.9K Mar 18 13:39 straight_no_index.parquet

edited Mar 18, 2021 at 20:31

Wolf

10.3k8 gold badges72 silver badges117 bronze badges

answered Mar 18, 2021 at 17:50

Ben Schmidt

1461 silver badge1 bronze badge

Sign up to request clarification or add additional context in comments.

4 Comments

Nils Blum-Oeste Over a year ago

Great information, thanks. Writing without indexes makes a tiny difference in my case (much less than 1 promille) but doesn't explain the observed big differences in file sizes at all. I also need to preserve the order of rows (especially when dropping the indexes).

Nils Blum-Oeste Over a year ago

Do you have an idea why it might be larger on disk than in memory?

Ben Schmidt Over a year ago

It's hard to think of any without knowing the shape of the data and the engine As Wolf said, maybe it's something that Python is hashing internally. A couple other possibilities:

Ben Schmidt Over a year ago

(continuing...) Maybe some of the additional data that parquet can store beyond pandas internals (e.g., the per-page index hints) are out of whack. Maybe somehow the encoding is actually losing space, which can happen; or you have table level information that df.memory_usage(deep=True) isn't measuring. (If your column names, e.g., are each a million characters long, they'll add a lot to file size but not to df.memory_usage(), which doesn't consider them).

datajoely · Accepted Answer · 2021-03-16 10:04:51Z

1

From a Kedro point of view this is just calling the PyArrow library write_table function doucmented here. Any of these parameters are available by the save_args argument in the catalog definition and may be worth playing around with?

answered Mar 16, 2021 at 10:04

datajoely

1,52610 silver badges13 bronze badges

Collectives™ on Stack Overflow

Parquet file larger than memory consumption of pandas DataFrame

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related