I am storing two different pandas DataFrames as parquet files (through kedro).
Both DataFrames have identical dimensions and dtypes (float32) before getting written to disk. Also, their memory consumption in RAM is identical:
distances_1.memory_usage(deep=True).sum()/1e9
# 3.730033604
distances_2.memory_usage(deep=True).sum()/1e9
# 3.730033604
When persisted as .parquet files, the first df results in a file of ~0.89GB and the second file results in a file ~4.5GB.
distances_1 has many more redundant values than distances_2 and thus compression might be more effective.
Loading the parquet files from disk into DataFrames results in valid data that is identical to the original DataFrames.
- How can the big size difference between the files be explained?
- For what reasons could the second file be larger than the in-memory data structure?
1GBmeans2^30 Bytes.