1

I'm currently trying to migrate some code to polars but noticed some performance differences in the process.

import os, platform, timeit, numpy as np, pandas as pd, polars as pl

data = np.random.rand(100000, 1)
df_pandas = pd.DataFrame(data)
df_polars = pl.DataFrame(data)

def timer(expr):
    return round(min(timeit.repeat(expr, repeat=5, number=5)), 8)

print("---- info ----")
print(f"platform={platform.platform()}; processor={platform.processor()}; CPUs={os.cpu_count()}")
print(f"python={platform.python_version()}; numpy={np.__version__}; pandas={pd.__version__}; polars={pl.__version__}")

print("---- pow(2) ----")
print("pandas:", timer(lambda: df_pandas.pow(2)))
print("polars:", timer(lambda: df_polars.select(pl.all().pow(2))))

print("---- sum ----")
print("pandas:", timer(lambda: df_pandas.sum()))
print("polars:", timer(lambda: df_polars.sum()))

The output of this snippet is

---- info ----
platform=macOS-11.6.5-x86_64-i386-64bit; processor=i386; CPUs=4
python=3.8.13; numpy=1.22.4; pandas=1.4.2; polars=0.13.47
---- pow(2) ----
pandas: 0.00147684
polars: 0.01482804
---- sum ----
pandas: 0.00300668
polars: 0.00027682

These results imply that polars is much slower than pandas for operations that include a Python select, but are faster for ones that are performed directly on the dataframe.

In reality, my dataframe is much different bigger (rows > 1,000,000, cols > 100,000), where the performance difference is much more significant.

Any suggestions for what might be going on and if there is a faster way to achieve the same (or better) performance in polars?

1
  • What datatype is your real data? (integer or float?) Is it float32 or 64? Since it recently came up, floating-point summation can have different accuracy depending on implementation - that's also a factor one needs to take into account. And pandas varies how they do it depending on which additional modules are installed. Commented Jun 18, 2022 at 19:58

1 Answer 1

1

In polars >= 0.13.49 the power operation is optimized to a square optimization on certain powers. If I run this, both operations are faster than pandas.

---- info ----
platform=Linux-5.13.0-51-generic-x86_64-with-glibc2.31; processor=x86_64; CPUs=12
python=3.9.12; numpy=1.22.4; pandas=1.4.2; polars=0.13.49
---- pow(2) ----
pandas: 0.00041451
polars: 0.0003346
---- sum ----
pandas: 0.00157432
polars: 0.00011628

Sign up to request clarification or add additional context in comments.

3 Comments

I have polars=0.14.2 and I did the same test (in a Kaggle notebook), and polars run much slower than pandas. Why could it be? --- info ---- platform=Linux-5.10.133+-x86_64-with-debian-bullseye-sid; processor=x86_64; CPUs=4 python=3.7.12; numpy=1.21.6; pandas=1.3.5; polars=0.14.2 ---- pow(2) ---- pandas: 0.00086124 polars: 3.85648806 ---- sum ---- pandas: 0.00369703 polars: 0.27322495
The dataframe construction from numpy has changed. Which I think is a bug. In any case, look at the shapes, you don't compare the same thing.
My run on polars==0.14.4: ---- info ---- platform=Linux-5.15.0-46-generic-x86_64-with-glibc2.31; processor=x86_64; CPUs=12 python=3.9.12; numpy=1.23.0; pandas=1.4.2; polars=0.14.4 ---- pow(2) ---- pandas: 0.00044163 polars: 0.00042512 ---- sum ---- pandas: 0.00169855 polars: 0.0001283 So the issues are resolved.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.