2

I am storing two different pandas DataFrames as parquet files (through kedro).

Both DataFrames have identical dimensions and dtypes (float32) before getting written to disk. Also, their memory consumption in RAM is identical:

distances_1.memory_usage(deep=True).sum()/1e9
# 3.730033604
distances_2.memory_usage(deep=True).sum()/1e9
# 3.730033604

When persisted as .parquet files, the first df results in a file of ~0.89GB and the second file results in a file ~4.5GB.

distances_1 has many more redundant values than distances_2 and thus compression might be more effective.

Loading the parquet files from disk into DataFrames results in valid data that is identical to the original DataFrames.

  • How can the big size difference between the files be explained?
  • For what reasons could the second file be larger than the in-memory data structure?
6
  • Wouldn't it be less confusing to translate RAM into usual units? Commented Mar 16, 2021 at 9:08
  • 1
    The code provided returns the total memory consumption of the dataframe in GB, right? I thought that would make it easy to compare it to the file sizes. Commented Mar 16, 2021 at 9:16
  • 1
    I see, it's only that sometimes (like in Windows Explorer) the unit 1GB means 2^30 Bytes. Commented Mar 16, 2021 at 9:27
  • Is this "many more redundant values" measurable in some way? Commented Mar 16, 2021 at 9:30
  • Of course you are right about the GB, the division by 1e9 is just an approximation, but I don't think this is crucial to the issue, is it? Commented Mar 16, 2021 at 10:03

2 Answers 2

5

As you say, the number of unique values can have a very important role in parquet size.

Translating from pandas, two other factors that can have a surprisingly large effect on parquet file size are:

  1. pandas indexes, which are saved by default even if they're just auto-assigned;
  2. the sorting of your data, which can make a large difference in the run-length encoding parquet sometimes uses.

Shuffled, auto-assigned indices can take a lot of space. If you don't care about the sort order of data on disk, worrying about this can make a significant difference.

Consider four cases of a pandas frame with one column containing the same data in all cases: the rounded squares of the first 2**16 integers. Storing it without indexes in sorted form takes 2.9K; shuffled without the auto-assigned index takes 66K; auto-assigning an index then shuffling takes 475K.

import pandas as pd
import numpy as np
!mkdir -p /tmp/parquet
d = pd.DataFrame({"A": np.floor(np.sqrt(np.arange(2**16)))})

d.to_parquet("/tmp/parquet/straight.parquet")
d.to_parquet("/tmp/parquet/straight_no_index.parquet", index = False)
d.sample(frac = 1).to_parquet("/tmp/parquet/shuf.parquet")
d.sample(frac = 1).to_parquet("/tmp/parquet/shuf_no_index.parquet", index = False)
ls -lSh /tmp/parquet
-rw-r--r--  1 user  wheel   475K Mar 18 13:39 shuf.parquet
-rw-r--r--  1 user  wheel    66K Mar 18 13:39 shuf_no_index.parquet
-rw-r--r--  1 user  wheel   3.3K Mar 18 13:39 straight.parquet
-rw-r--r--  1 user  wheel   2.9K Mar 18 13:39 straight_no_index.parquet
Sign up to request clarification or add additional context in comments.

4 Comments

Great information, thanks. Writing without indexes makes a tiny difference in my case (much less than 1 promille) but doesn't explain the observed big differences in file sizes at all. I also need to preserve the order of rows (especially when dropping the indexes).
Do you have an idea why it might be larger on disk than in memory?
It's hard to think of any without knowing the shape of the data and the engine As Wolf said, maybe it's something that Python is hashing internally. A couple other possibilities:
(continuing...) Maybe some of the additional data that parquet can store beyond pandas internals (e.g., the per-page index hints) are out of whack. Maybe somehow the encoding is actually losing space, which can happen; or you have table level information that df.memory_usage(deep=True) isn't measuring. (If your column names, e.g., are each a million characters long, they'll add a lot to file size but not to df.memory_usage(), which doesn't consider them).
1

From a Kedro point of view this is just calling the PyArrow library write_table function doucmented here. Any of these parameters are available by the save_args argument in the catalog definition and may be worth playing around with?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.