Performance improvements for python-polars `df.select` operations

Question

I'm currently trying to migrate some code to polars but noticed some performance differences in the process.

import os, platform, timeit, numpy as np, pandas as pd, polars as pl

data = np.random.rand(100000, 1)
df_pandas = pd.DataFrame(data)
df_polars = pl.DataFrame(data)

def timer(expr):
    return round(min(timeit.repeat(expr, repeat=5, number=5)), 8)

print("---- info ----")
print(f"platform={platform.platform()}; processor={platform.processor()}; CPUs={os.cpu_count()}")
print(f"python={platform.python_version()}; numpy={np.__version__}; pandas={pd.__version__}; polars={pl.__version__}")

print("---- pow(2) ----")
print("pandas:", timer(lambda: df_pandas.pow(2)))
print("polars:", timer(lambda: df_polars.select(pl.all().pow(2))))

print("---- sum ----")
print("pandas:", timer(lambda: df_pandas.sum()))
print("polars:", timer(lambda: df_polars.sum()))

The output of this snippet is

---- info ----
platform=macOS-11.6.5-x86_64-i386-64bit; processor=i386; CPUs=4
python=3.8.13; numpy=1.22.4; pandas=1.4.2; polars=0.13.47
---- pow(2) ----
pandas: 0.00147684
polars: 0.01482804
---- sum ----
pandas: 0.00300668
polars: 0.00027682

These results imply that polars is much slower than pandas for operations that include a Python select, but are faster for ones that are performed directly on the dataframe.

In reality, my dataframe is much different bigger (rows > 1,000,000, cols > 100,000), where the performance difference is much more significant.

Any suggestions for what might be going on and if there is a faster way to achieve the same (or better) performance in polars?

What datatype is your real data? (integer or float?) Is it float32 or 64? Since it recently came up, floating-point summation can have different accuracy depending on implementation - that's also a factor one needs to take into account. And pandas varies how they do it depending on which additional modules are installed. — ramslök
– ramslök, Commented Jun 18, 2022 at 19:58

ritchie46 · Accepted Answer · 2022-06-20 12:01:20Z

1

In polars >= 0.13.49 the power operation is optimized to a square optimization on certain powers. If I run this, both operations are faster than pandas.

---- info ----
platform=Linux-5.13.0-51-generic-x86_64-with-glibc2.31; processor=x86_64; CPUs=12
python=3.9.12; numpy=1.22.4; pandas=1.4.2; polars=0.13.49
---- pow(2) ----
pandas: 0.00041451
polars: 0.0003346
---- sum ----
pandas: 0.00157432
polars: 0.00011628

edited Jun 20, 2022 at 12:01

answered Jun 18, 2022 at 13:44

ritchie46

15.6k2 gold badges45 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Chris Over a year ago

I have polars=0.14.2 and I did the same test (in a Kaggle notebook), and polars run much slower than pandas. Why could it be?

--- info ---- platform=Linux-5.10.133+-x86_64-with-debian-bullseye-sid; processor=x86_64; CPUs=4 python=3.7.12; numpy=1.21.6; pandas=1.3.5; polars=0.14.2 ---- pow(2) ---- pandas: 0.00086124 polars: 3.85648806 ---- sum ---- pandas: 0.00369703 polars: 0.27322495

ritchie46 Over a year ago

The dataframe construction from numpy has changed. Which I think is a bug. In any case, look at the shapes, you don't compare the same thing.

ritchie46 Over a year ago

My run on polars==0.14.4:

---- info ---- platform=Linux-5.15.0-46-generic-x86_64-with-glibc2.31; processor=x86_64; CPUs=12 python=3.9.12; numpy=1.23.0; pandas=1.4.2; polars=0.14.4 ---- pow(2) ---- pandas: 0.00044163 polars: 0.00042512 ---- sum ---- pandas: 0.00169855 polars: 0.0001283

So the issues are resolved.

Collectives™ on Stack Overflow

Performance improvements for python-polars `df.select` operations

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related