2

Is there a way to reference another Polars Dataframe in Polars expressions without using lambdas?

Just to use a simple example - suppose I have two dataframes:

import polars as pl

df_1 = pl.DataFrame(
    {
        "time": pl.date_range(
            pl.date(2021, 1, 1),
            pl.date(2022, 1, 1),
            interval="1d",
            eager=True
        ),
        "x": pl.int_range(0, 366, eager=True),
    }
)

df_2 = pl.DataFrame(
    {
        "time": pl.date_range(
            pl.date(2021, 1, 1),
            pl.date(2021, 2, 1),
            interval="1mo",
            eager=True
        ),
        "y": [50, 100],
    }
)

For each y value in df_2, I would like to find the maximum date in df_1, conditional on the x value being lower than the y.

I am able to perform this using map_elements/lambda (see below), but just wondering whether there is a more idiomatic way of performing this operation?

df_2.group_by("y").agg(
    pl.col("y").map_elements(lambda s: df_1.filter(pl.col("x") < s).select(pl.col("time")).max().item()).alias('latest')
)
shape: (2, 2)
┌─────┬────────────┐
│ y   ┆ latest     │
│ --- ┆ ---        │
│ i64 ┆ date       │
╞═════╪════════════╡
│ 50  ┆ 2021-02-19 │
│ 100 ┆ 2021-04-10 │
└─────┴────────────┘

Edit:

Is it possible to pre-filter df_1 prior to using join_asof. So switching the question to look for the min instead of the max, on an individual case this is what I would do:

(
    df_2
    .filter(pl.col('y') == 50)
    .join_asof(
                df_1
                .sort("x")
                .filter(pl.col('time') > date(2021,11,1))
                .select(
                    pl.col("time").cum_min().alias("time_min"),
                    pl.col("x").alias("original_x"),
                    (pl.col("x") + 1).alias("x"),
                ),
            left_on="y",
            right_on="x",
            strategy="forward",
    )
)

Is there a way to generalise this merge without using a loop / map_elements function?

1 Answer 1

2

Edit: Generalizing a join

One somewhat-dangerous approach to generalizing a join (so that you can run any sub-queries and filters that you like) is to use a "cross" join.

I say "somewhat-dangerous" because the number of row combinations considered in a cross join is M x N, where M and N are the number of rows in your two DataFrames. So if your two DataFrames are 1 million rows each, you have (1 million x 1 million) row combinations that are being considered. This process can exhaust your RAM or simply take a long time.

If you'd like to try it, here's how it would work (along with some arbitrary filters that I constructed, just to show the ultimate flexibility that a cross-join creates).

(
    df_2.lazy()
    .join(
        df_1.lazy(),
        how="cross"
    )
    .filter(pl.col('time_right') >= pl.col('time'))
    .group_by('y')
    .agg(
        pl.col('time').first(),

        pl.col('x')
        .filter(pl.col('y') > pl.col('x'))
        .max()
        .alias('max(x) for(y>x)'),

        pl.col('time_right')
        .filter(pl.col('y') > pl.col('x'))
        .max()
        .alias('max(time_right) for(y>x)'),

        pl.col('time_right')
        .filter(
            pl.col('y') <= pl.col('x'),
            pl.col('time_right') > pl.col('time')
        )
        .min()
        .alias('min(time_right) for(two filters)'),
    )
    .collect()
)
shape: (2, 5)
┌─────┬────────────┬─────────────────┬──────────────────────────┬──────────────────────────────────┐
│ y   ┆ time       ┆ max(x) for(y>x) ┆ max(time_right) for(y>x) ┆ min(time_right) for(two filters) │
│ --- ┆ ---        ┆ ---             ┆ ---                      ┆ ---                              │
│ i64 ┆ date       ┆ i64             ┆ date                     ┆ date                             │
╞═════╪════════════╪═════════════════╪══════════════════════════╪══════════════════════════════════╡
│ 100 ┆ 2021-02-01 ┆ 99              ┆ 2021-04-10               ┆ 2021-04-11                       │
│ 50  ┆ 2021-01-01 ┆ 49              ┆ 2021-02-19               ┆ 2021-02-20                       │
└─────┴────────────┴─────────────────┴──────────────────────────┴──────────────────────────────────┘

Couple of suggestions:

  • I strongly recommend running the cross-join in Lazy mode.
  • Try to filter directly after the cross-join, to eliminate row combinations that you will never need. This reduces the burden on the later groupby step.

Given the explosive potential of row combinations for cross-joins, I tried to steer you toward a join_asof (which did solve the original sample question). But if you need the flexibility beyond what a join_asof can provide, the cross-join will provide ultimate flexibility -- at a cost.

join_asof

We can use a join_asof to accomplish this, with two wrinkles.

The Algorithm

(
    df_2
    .sort("y")
    .join_asof(
        (
            df_1
            .sort("x")
            .select(
                pl.col("time").cum_max().alias("time_max"),
                (pl.col("x") + 1),
            )
        ),
        left_on="y",
        right_on="x",
        strategy="backward",
    )
    .drop('x')
)
shape: (2, 3)
┌────────────┬─────┬────────────┐
│ time       ┆ y   ┆ time_max   │
│ ---        ┆ --- ┆ ---        │
│ date       ┆ i64 ┆ date       │
╞════════════╪═════╪════════════╡
│ 2021-01-01 ┆ 50  ┆ 2021-02-19 │
│ 2021-02-01 ┆ 100 ┆ 2021-04-10 │
└────────────┴─────┴────────────┘

This matches the output of your code.

In steps

Let's add some extra information to our query, to elucidate how it works.

(
    df_2
    .sort("y")
    .join_asof(
        (
            df_1
            .sort("x")
            .select(
                pl.col("time").cum_max().alias("time_max"),
                pl.col("x").alias("original_x"),
                (pl.col("x") + 1).alias("x"),
            )
        ),
        left_on="y",
        right_on="x",
        strategy="backward",
    )
)
shape: (2, 5)
┌────────────┬─────┬────────────┬────────────┬─────┐
│ time       ┆ y   ┆ time_max   ┆ original_x ┆ x   │
│ ---        ┆ --- ┆ ---        ┆ ---        ┆ --- │
│ date       ┆ i64 ┆ date       ┆ i64        ┆ i64 │
╞════════════╪═════╪════════════╪════════════╪═════╡
│ 2021-01-01 ┆ 50  ┆ 2021-02-19 ┆ 49         ┆ 50  │
│ 2021-02-01 ┆ 100 ┆ 2021-04-10 ┆ 99         ┆ 100 │
└────────────┴─────┴────────────┴────────────┴─────┘

Getting the maximum date

Instead of attempting a "non-equi" join or sub-queries to obtain the maximum date for x or any lesser value of x, we can use a simpler approach: sort df_2 by x and calculate the cumulative maximum date for each "x". That way, when we join, we can join to a single row in df_2 and be certain that for any x, we are getting the maximum date for that x and all lesser values of x. The cumulative maximum is displayed above as time_max.

less-than (and not less-than-or-equal-to)

From the documentation for join_as:

A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.

Since you want "less than" and not "less than or equal to", we can simply increase each value of x by 1. Since x and y are integers, this will work. The result above displays both the original value of x (original_x), and the adjusted value (x) used in the join_asof.

If x and y are floats, you can add an arbitrarily small amount to x (e.g., x + 0.000000001) to force the non-equality.

Sign up to request clarification or add additional context in comments.

6 Comments

I should mention: the above algorithm assumes that the values of x in df_2 are unique. If not, then we have to take an additional step before the cummax step: calculating the maximum value of time for each x, then run cummax.
I should also add: I’m assuming there are potential gaps in the x values so that there may not be an exact match for (y-1) in df_2. If you’re guaranteed that all integer values of x are present, the join_asof can be reduced to a simple join.
Thanks. Is it possible to add in a second condition that the time in df_1 must be greater than df_2. I.e. if df_2 = pl.DataFrame( { "time": pl.date_range( low=date(2021, 11, 1), high=date(2021, 12, 1), interval="1mo", ), "y": [50, 100], } ) - then performing the operation above would give time_max < time in the code above. The only way I can think of resovling would be via a for loop, so would be interested in whether there is an alternative?
Not sure I understand. In your example in the comment above, for y == 50, the maximum value of time in df_1 (for x < y) is 2021-02-19, which is less than time in df_2 (2021-11-01). So what should happen then? What value are you looking for in that case?
Sorry, I just realised I have been unclear, will edit the post to clarify. thanks
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.