1

I've noticed some unexpected behavior with the interpolate_by expression and I'm not sure what is going on.

import polars as pl

df = pl.DataFrame({
    'a': [1, 2, 3, 4, 5],
    'b': [4, 5, None, 7, 8]
})
df = df.with_columns(interpolate = pl.col('b').interpolate_by('a'))
print(df)

results in this:

┌─────┬──────┬─────────────┐
│ a   ┆ b    ┆ interpolate │
│ --- ┆ ---  ┆ ---         │
│ i64 ┆ i64  ┆ f64         │
╞═════╪══════╪═════════════╡
│ 1   ┆ 4    ┆ 4.0         │
│ 2   ┆ 5    ┆ 5.0         │
│ 3   ┆ null ┆ 6.0         │
│ 4   ┆ 7    ┆ 7.0         │
│ 5   ┆ 8    ┆ 8.0         │
└─────┴──────┴─────────────┘

which is correct. However this:

df = pl.DataFrame({
    'a': [1, 2, 3, 4, 5],
    'b': [4, 5, 6, 7, None]
})
df = df.with_columns(interpolate = pl.col('b').interpolate_by('a'))
print(df)

results in this:

shape: (5, 3)
┌─────┬──────┬─────────────┐
│ a   ┆ b    ┆ interpolate │
│ --- ┆ ---  ┆ ---         │
│ i64 ┆ i64  ┆ f64         │
╞═════╪══════╪═════════════╡
│ 1   ┆ 4    ┆ 4.0         │
│ 2   ┆ 5    ┆ 5.0         │
│ 3   ┆ 6    ┆ 6.0         │
│ 4   ┆ 7    ┆ 7.0         │
│ 5   ┆ null ┆ null        │
└─────┴──────┴─────────────┘

which is not correct. There is still plenty of data to perform a linear interpolation on column B using the data in column A. Am I missing something here and don't understand how this is supposed to work or is this a bug?

2
  • if you think this command should work in different way them maybe send this problem to authors of polars as issue. Maybe it is bug. Commented Feb 23 at 20:30
  • It can only interpolate missing values in between known values, not at the beginning or end. Commented Feb 24 at 1:06

1 Answer 1

0

Just a quick pedantic point, polars isn't failing to interpolate. You're wanting it to extrapolate.

If you want it to take the last trend and apply it then there's not a built in way to do that but you can do it yourself with expressions with forward_fill and shift.

df.with_columns(
    pl.col("b").fill_null(
        ((pl.col("b")-pl.col("b").shift())/(pl.col("a")-pl.col("a").shift()))
        .forward_fill() *
        (pl.col("a")-pl.col("a").shift()) + pl.col("b").shift()
        )
    )

This extrapolation is interchangeable with your first example's interpolate since there's a constant trend but it'd give you a much different answer if the trend was differing over time (see note at the end).

You can turn this into a function for reuse with

def extrapolate_by(col_to_fill:str|pl.Expr, index_col:str|pl.Expr) -> pl.Expr:
    if isinstance(col_to_fill, str):
        col_to_fill=pl.col(col_to_fill)
    if isinstance(index_col, str):
        index_col=pl.col(index_col)
    return col_to_fill.fill_null(
        ((col_to_fill-col_to_fill.shift())/(index_col-index_col.shift()))
        .forward_fill() *
        ((index_col-index_col.shift())) + col_to_fill.shift()
    )

and then you can simple do df.with_columns(extrapolate_by("b","a")) or if you don't mind not having type hints, you can monkey patch it to pl.Expr and its usage will be chainable

pl.Expr.extrapolate_by = extrapolate_by
df.with_columns(pl.col("b").extrapolate_by("a"))

As is, this only works forward, if you want to fill missing values at the beginning, you'd have to inverse the shift direction and fill direction.

Note on difference between extrapolate and interpolate.

df = pl.DataFrame({
    'a': [1, 2, 3, 4, 5],
    'b': [4, 6, None, 20, 22]
})

df.with_columns(
    interpolate=pl.col("b").interpolate_by("a"),
    extrapolate=pl.col("b").extrapolate_by("a")
    )

shape: (5, 4)
┌─────┬──────┬─────────────┬─────────────┐
│ a   ┆ b    ┆ interpolate ┆ extrapolate │
│ --- ┆ ---  ┆ ---         ┆ ---         │
│ i64 ┆ i64  ┆ f64         ┆ f64         │
╞═════╪══════╪═════════════╪═════════════╡
│ 1   ┆ 4    ┆ 4.0         ┆ 4.0         │
│ 2   ┆ 6    ┆ 6.0         ┆ 6.0         │
│ 3   ┆ null ┆ 13.0        ┆ 8.0         │
│ 4   ┆ 20   ┆ 20.0        ┆ 20.0        │
│ 5   ┆ 22   ┆ 22.0        ┆ 22.0        │
└─────┴──────┴─────────────┴─────────────┘

When you interpolate it takes the known values on either side of the missing value to deduce what the missing value should have been. Extrapolating only takes old information to apply it to the future. In this example the trend changes so there's a big difference between extrapolating and interpolating so just be aware that these aren't always interchangeable.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. 13 is the answer I'm looking in your example. Linear interpolation can solve for any point along a line once 2x and 2y values are know. A line doesn't stop being linear when it passes outside of a boundary and the Polars docs specifically refer to "linear interpolation" method for interpolate_by. It isn't cubic and only two non-nulls can only be used to define the "line" but they don't need to bracket the null mathematically. I am applying this over very large datasets so Polars would have sped it up but I'm just going back to my original non-Polars solution unfortunately.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.