I'm currently trying to migrate some code to polars but noticed some performance differences in the process.
import os, platform, timeit, numpy as np, pandas as pd, polars as pl
data = np.random.rand(100000, 1)
df_pandas = pd.DataFrame(data)
df_polars = pl.DataFrame(data)
def timer(expr):
return round(min(timeit.repeat(expr, repeat=5, number=5)), 8)
print("---- info ----")
print(f"platform={platform.platform()}; processor={platform.processor()}; CPUs={os.cpu_count()}")
print(f"python={platform.python_version()}; numpy={np.__version__}; pandas={pd.__version__}; polars={pl.__version__}")
print("---- pow(2) ----")
print("pandas:", timer(lambda: df_pandas.pow(2)))
print("polars:", timer(lambda: df_polars.select(pl.all().pow(2))))
print("---- sum ----")
print("pandas:", timer(lambda: df_pandas.sum()))
print("polars:", timer(lambda: df_polars.sum()))
The output of this snippet is
---- info ----
platform=macOS-11.6.5-x86_64-i386-64bit; processor=i386; CPUs=4
python=3.8.13; numpy=1.22.4; pandas=1.4.2; polars=0.13.47
---- pow(2) ----
pandas: 0.00147684
polars: 0.01482804
---- sum ----
pandas: 0.00300668
polars: 0.00027682
These results imply that polars is much slower than pandas for operations that include a Python select, but are faster for ones that are performed directly on the dataframe.
In reality, my dataframe is much different bigger (rows > 1,000,000, cols > 100,000), where the performance difference is much more significant.
Any suggestions for what might be going on and if there is a faster way to achieve the same (or better) performance in polars?