Polars interpolate_by fails when null values are at the beginning or end of a column

Question

I've noticed some unexpected behavior with the interpolate_by expression and I'm not sure what is going on.

import polars as pl

df = pl.DataFrame({
    'a': [1, 2, 3, 4, 5],
    'b': [4, 5, None, 7, 8]
})
df = df.with_columns(interpolate = pl.col('b').interpolate_by('a'))
print(df)

results in this:

┌─────┬──────┬─────────────┐
│ a   ┆ b    ┆ interpolate │
│ --- ┆ ---  ┆ ---         │
│ i64 ┆ i64  ┆ f64         │
╞═════╪══════╪═════════════╡
│ 1   ┆ 4    ┆ 4.0         │
│ 2   ┆ 5    ┆ 5.0         │
│ 3   ┆ null ┆ 6.0         │
│ 4   ┆ 7    ┆ 7.0         │
│ 5   ┆ 8    ┆ 8.0         │
└─────┴──────┴─────────────┘

which is correct. However this:

df = pl.DataFrame({
    'a': [1, 2, 3, 4, 5],
    'b': [4, 5, 6, 7, None]
})
df = df.with_columns(interpolate = pl.col('b').interpolate_by('a'))
print(df)

results in this:

shape: (5, 3)
┌─────┬──────┬─────────────┐
│ a   ┆ b    ┆ interpolate │
│ --- ┆ ---  ┆ ---         │
│ i64 ┆ i64  ┆ f64         │
╞═════╪══════╪═════════════╡
│ 1   ┆ 4    ┆ 4.0         │
│ 2   ┆ 5    ┆ 5.0         │
│ 3   ┆ 6    ┆ 6.0         │
│ 4   ┆ 7    ┆ 7.0         │
│ 5   ┆ null ┆ null        │
└─────┴──────┴─────────────┘

which is not correct. There is still plenty of data to perform a linear interpolation on column B using the data in column A. Am I missing something here and don't understand how this is supposed to work or is this a bug?

if you think this command should work in different way them maybe send this problem to authors of polars as issue. Maybe it is bug. — furas
– furas, Commented Feb 23 at 20:30
It can only interpolate missing values in between known values, not at the beginning or end. — Dean MacGregor
– Dean MacGregor, Commented Feb 24 at 1:06

Dean MacGregor · Accepted Answer · 2025-02-24 18:46:35Z

Just a quick pedantic point, polars isn't failing to interpolate. You're wanting it to extrapolate.

If you want it to take the last trend and apply it then there's not a built in way to do that but you can do it yourself with expressions with forward_fill and shift.

df.with_columns(
    pl.col("b").fill_null(
        ((pl.col("b")-pl.col("b").shift())/(pl.col("a")-pl.col("a").shift()))
        .forward_fill() *
        (pl.col("a")-pl.col("a").shift()) + pl.col("b").shift()
        )
    )

This extrapolation is interchangeable with your first example's interpolate since there's a constant trend but it'd give you a much different answer if the trend was differing over time (see note at the end).

You can turn this into a function for reuse with

def extrapolate_by(col_to_fill:str|pl.Expr, index_col:str|pl.Expr) -> pl.Expr:
    if isinstance(col_to_fill, str):
        col_to_fill=pl.col(col_to_fill)
    if isinstance(index_col, str):
        index_col=pl.col(index_col)
    return col_to_fill.fill_null(
        ((col_to_fill-col_to_fill.shift())/(index_col-index_col.shift()))
        .forward_fill() *
        ((index_col-index_col.shift())) + col_to_fill.shift()
    )

and then you can simple do df.with_columns(extrapolate_by("b","a")) or if you don't mind not having type hints, you can monkey patch it to pl.Expr and its usage will be chainable

pl.Expr.extrapolate_by = extrapolate_by
df.with_columns(pl.col("b").extrapolate_by("a"))

As is, this only works forward, if you want to fill missing values at the beginning, you'd have to inverse the shift direction and fill direction.

Note on difference between extrapolate and interpolate.

df = pl.DataFrame({
    'a': [1, 2, 3, 4, 5],
    'b': [4, 6, None, 20, 22]
})

df.with_columns(
    interpolate=pl.col("b").interpolate_by("a"),
    extrapolate=pl.col("b").extrapolate_by("a")
    )

shape: (5, 4)
┌─────┬──────┬─────────────┬─────────────┐
│ a   ┆ b    ┆ interpolate ┆ extrapolate │
│ --- ┆ ---  ┆ ---         ┆ ---         │
│ i64 ┆ i64  ┆ f64         ┆ f64         │
╞═════╪══════╪═════════════╪═════════════╡
│ 1   ┆ 4    ┆ 4.0         ┆ 4.0         │
│ 2   ┆ 6    ┆ 6.0         ┆ 6.0         │
│ 3   ┆ null ┆ 13.0        ┆ 8.0         │
│ 4   ┆ 20   ┆ 20.0        ┆ 20.0        │
│ 5   ┆ 22   ┆ 22.0        ┆ 22.0        │
└─────┴──────┴─────────────┴─────────────┘

When you interpolate it takes the known values on either side of the missing value to deduce what the missing value should have been. Extrapolating only takes old information to apply it to the future. In this example the trend changes so there's a big difference between extrapolating and interpolating so just be aware that these aren't always interchangeable.

Thank you. 13 is the answer I'm looking in your example. Linear interpolation can solve for any point along a line once 2x and 2y values are know. A line doesn't stop being linear when it passes outside of a boundary and the Polars docs specifically refer to "linear interpolation" method for interpolate_by. It isn't cubic and only two non-nulls can only be used to define the "line" but they don't need to bracket the null mathematically. I am applying this over very large datasets so Polars would have sped it up but I'm just going back to my original non-Polars solution unfortunately.

Collectives™ on Stack Overflow

Polars interpolate_by fails when null values are at the beginning or end of a column

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related