Unexpected behaviour for numpy/polars correlation given large values

Question

both for polars and numpy, correlation functions seem to break down given very large changes to the location.

I presume that has to do with precision issues, as e.g. a bazillion +1 is viewed as equal to a bazillion +2. Therefore my question is on how to best handle this. First idea seems to de-mean, which will naturally slow down the code, but at least I should avoid the RNG behaviour. What would be the standard approach?

Reproducable example:

import polars as pl 
df =  pl.DataFrame({
    "a": [1.0, 2.0, 3.0, 1.0, 2.0, 3.0],
    "b": [4.0, 3.0, 0.0, 1.0, 2.0, 0.0],
})
(df+1123000000000000000000.0).corr()

# Outputs
#shape: (2, 2)
#┌─────┬─────┐
#│ a   ┆ b   │
#│ --- ┆ --- │
#│ f64 ┆ f64 │
#╞═════╪═════╡
#│ 1.0 ┆ 1.0 │
#│ 1.0 ┆ 1.0 │
#└─────┴─────┘
(df+112300000000000000000.0).corr()

# Outputs
#shape: (2, 2)
#┌─────┬─────┐
#│ a   ┆ b   │
#│ --- ┆ --- │
#│ f64 ┆ f64 │
#╞═════╪═════╡
#│ NaN ┆ NaN │
#│ NaN ┆ NaN │
#└─────┴─────┘

(df+11230000000000000.0).corr()

# Still wrong output
#shape: (2, 2)
#┌───────────┬───────────┐
#│ a         ┆ b         │
#│ ---       ┆ ---       │
#│ f64       ┆ f64       │
#╞═══════════╪═══════════╡
#│ 1.0       ┆ -0.424264 │
#│ -0.424264 ┆ 1.0       │
#└───────────┴───────────┘

(df+1123000000000.0).corr()
# Correct output
# shape: (2, 2)
#┌───────────┬───────────┐
#│ a         ┆ b         │
#│ ---       ┆ ---       │
#│ f64       ┆ f64       │
#╞═══════════╪═══════════╡
#│ 1.0       ┆ -0.684653 │
#│ -0.684653 ┆ 1.0       │
#└───────────┴───────────┘

etrotta · Accepted Answer · 2025-03-12 01:06:58Z

4

With sufficiently large floating point numbers, I wouldn't even call it "viewed as equal", it becomes literally the same number as floats cannot represent the difference between them anymore. There is no way to recover the original number at that point.

For example, df + 1e20 - 1e20 will give you exactly 0.0 for every single row.

The same happens with your "Still wrong output" (1.123e16) example:

>>> df + 1.123e16 - 1.123e16
shape: (6, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═════╡
│ 0.0 ┆ 4.0 │
│ 2.0 ┆ 4.0 │
│ 4.0 ┆ 0.0 │
│ 0.0 ┆ 0.0 │
│ 2.0 ┆ 2.0 │
│ 4.0 ┆ 0.0 │
└─────┴─────┘

The only way to preserve that difference would be not using floats, but keep in mind that may significantly impact your performance... That said, the corr method relies on numpy, and numpy does not supports Decimal, so you'll have to first scale using a lossless datatype and then cast to float:

val = pl.lit(1e21, dtype=pl.Decimal)
mean_expr = pl.all().mean().cast(pl.Decimal)
df = df.select(pl.all().cast(pl.Decimal) + val)
df.select(pl.all() - mean_expr).cast(float).corr()

See also print(f'https://{.1+.2}.com')

answered Mar 12 at 1:06

etrotta

1,0701 silver badge9 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Dean MacGregor Mar 12 at 20:36

Just a nit, polars doesn't use numpy to do the calc. github.com/pola-rs/polars/blob/…

etrotta Mar 12 at 22:06

@DeanMacGregor using the df.corr() method it does github.com/pola-rs/polars/blob/…

Dean MacGregor Mar 13 at 2:14

Oops, my mistake.

Collectives™ on Stack Overflow

Unexpected behaviour for numpy/polars correlation given large values

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related