1

both for polars and numpy, correlation functions seem to break down given very large changes to the location.

I presume that has to do with precision issues, as e.g. a bazillion +1 is viewed as equal to a bazillion +2. Therefore my question is on how to best handle this. First idea seems to de-mean, which will naturally slow down the code, but at least I should avoid the RNG behaviour. What would be the standard approach?

Reproducable example:

import polars as pl 
df =  pl.DataFrame({
    "a": [1.0, 2.0, 3.0, 1.0, 2.0, 3.0],
    "b": [4.0, 3.0, 0.0, 1.0, 2.0, 0.0],
})
(df+1123000000000000000000.0).corr()

# Outputs
#shape: (2, 2)
#┌─────┬─────┐
#│ a   ┆ b   │
#│ --- ┆ --- │
#│ f64 ┆ f64 │
#╞═════╪═════╡
#│ 1.0 ┆ 1.0 │
#│ 1.0 ┆ 1.0 │
#└─────┴─────┘
(df+112300000000000000000.0).corr()

# Outputs
#shape: (2, 2)
#┌─────┬─────┐
#│ a   ┆ b   │
#│ --- ┆ --- │
#│ f64 ┆ f64 │
#╞═════╪═════╡
#│ NaN ┆ NaN │
#│ NaN ┆ NaN │
#└─────┴─────┘

(df+11230000000000000.0).corr()

# Still wrong output
#shape: (2, 2)
#┌───────────┬───────────┐
#│ a         ┆ b         │
#│ ---       ┆ ---       │
#│ f64       ┆ f64       │
#╞═══════════╪═══════════╡
#│ 1.0       ┆ -0.424264 │
#│ -0.424264 ┆ 1.0       │
#└───────────┴───────────┘

(df+1123000000000.0).corr()
# Correct output
# shape: (2, 2)
#┌───────────┬───────────┐
#│ a         ┆ b         │
#│ ---       ┆ ---       │
#│ f64       ┆ f64       │
#╞═══════════╪═══════════╡
#│ 1.0       ┆ -0.684653 │
#│ -0.684653 ┆ 1.0       │
#└───────────┴───────────┘


1 Answer 1

4

With sufficiently large floating point numbers, I wouldn't even call it "viewed as equal", it becomes literally the same number as floats cannot represent the difference between them anymore. There is no way to recover the original number at that point.

For example, df + 1e20 - 1e20 will give you exactly 0.0 for every single row.

The same happens with your "Still wrong output" (1.123e16) example:

>>> df + 1.123e16 - 1.123e16
shape: (6, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═════╡
│ 0.0 ┆ 4.0 │
│ 2.0 ┆ 4.0 │
│ 4.0 ┆ 0.0 │
│ 0.0 ┆ 0.0 │
│ 2.0 ┆ 2.0 │
│ 4.0 ┆ 0.0 │
└─────┴─────┘

The only way to preserve that difference would be not using floats, but keep in mind that may significantly impact your performance... That said, the corr method relies on numpy, and numpy does not supports Decimal, so you'll have to first scale using a lossless datatype and then cast to float:

val = pl.lit(1e21, dtype=pl.Decimal)
mean_expr = pl.all().mean().cast(pl.Decimal)
df = df.select(pl.all().cast(pl.Decimal) + val)
df.select(pl.all() - mean_expr).cast(float).corr()

See also print(f'https://{.1+.2}.com')

Sign up to request clarification or add additional context in comments.

3 Comments

Just a nit, polars doesn't use numpy to do the calc. github.com/pola-rs/polars/blob/…
@DeanMacGregor using the df.corr() method it does github.com/pola-rs/polars/blob/…
Oops, my mistake.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.