I have two large datasets stored in partitioned Parquet format on S3, partitioned by category_id. I need to join them on category_id and label_id using Polars and write the results to Postgres.
The problem:
- Data is too large for memory: Calling
.collect()on the joined dataframe is not feasible. - Writing per partition is too slow: Iterating over
category_idand processing each partition takes too long (several seconds per partition).
My current approach:
import polars as pl
df_left = pl.scan_parquet(
"s3://my-bucket/left/**/*.parquet",
hive_partitioning=True,
)
df_right = pl.scan_parquet(
"s3://my-bucket/right/**/*.parquet",
hive_partitioning=True,
)
df_joined = df_left.join(df_right, on=["category_id", "label_id"], how="inner")
At this point, I would like to efficiently:
- Process the data in a streaming fashion
- Write the joined data to Postgres
What I have tried:
- Looping over unique
category_idsand processing one by one.- Too slow, as reading Parquet partitions takes a few seconds each time.
- Using Polars' lazy execution (
scan_parquet).- I cannot .collect() since it does not fit in memory.
- I am unsure how to efficiently stream the data to Postgres.
Question
How can I efficiently join two large Parquet datasets using Polars and write the result to Postgres in a way that avoids memory issues and excessive partition reads?
Would using Polars streaming help here? Or is there a way to batch-process partitions efficiently without reading each partition multiple times?