2,817 questions
2
votes
0
answers
189
views
Why does a Parquet file written with Polars query faster than one written with Spark?
I am writing Parquet files using two different frameworks—Apache Spark (Scala) and Polars (Python)—with the same schema and data. However, when I query the resulting Parquet files using Apache ...
2
votes
1
answer
205
views
Polars group_by + describe: return all columns as single dataframe
I'm slowly migrating to polars from pandas and I have found that in some cases the polars syntax is tricky.
I'm seeking help to do a group_by followed by a describe using less (or more readable) code.
...
5
votes
2
answers
227
views
Expanding polars dataframe with cartesian product of two columns [duplicate]
The code below shows a solution I have found in order to expand a dataframe to include the cartesian product of columns A and B, filling in the other columns with null values. I'm wondering if there ...
0
votes
3
answers
256
views
Sum of products of columns in polars
I have a dataset, part of which looks like this:
customer
product
price
quantity
sale_time
C060235
P0204
6.99
2
2024-03-11 08:24:11
C045298
P0167
14.99
1
2024-03-11 08:35:06
...
C039877
P0024
126.95
1
...
0
votes
0
answers
216
views
Values differ on multiple reads from parquet files using polars read_parquet but not with pandas read_parquet by workstation
I read data from the same parquet files multiple times using polars (polars rust engine and pyarrow) and using pandas pyarrow backend (not fastparquet as it was very slow), see below code.
All the ...
0
votes
1
answer
82
views
Efficient (and Safe) Way of Accessing Larger-than-Memory Datasets in Parallel
I am trying to use polars~=1.24.0 on Python 3.13 to process larger-than-memory sized datasets. Specifically, I am loading many (i.e., 35 of them) parquet files via the polars.scan_parquet('base-name-*....
1
vote
1
answer
105
views
Polars upsampling with grouping does not behave as expected
Here is the data
import polars as pl
from datetime import datetime
df = pl.DataFrame(
{
"time": [
datetime(2021, 2, 1),
datetime(2021, 4, 2),
...
1
vote
1
answer
588
views
Why can I read a file (gVCF) with pandas but not with polars?
I have a CSV (or rather TSV) I got from stripping the header off a gVCF with
bcftools view foo.g.vcf -H > foo.g.vcf.csv
A head gives me this, so everything looks like expected so far
chr1H 1 ...
0
votes
1
answer
81
views
Convert the values in a SQL array by SELECTing from another table?
I have a table "Data" containing arrays of FOOs, and a separate table "Lookup" where I can find the BAR for each FOO. I want to write a SQL query which returns the Data table, but ...
2
votes
2
answers
155
views
Create a uniform dataset in Polars with cross joins
I am working with Polars and need to ensure that my dataset contains all possible combinations of unique values in certain index columns. If a combination is missing in the original data, it should be ...
2
votes
1
answer
100
views
Detect coordinate precision in polars floats?
I have some coordinate data; some of it high precision, some of it low precision thanks to multiple data sources and other operational realities. I want to have a column that indicates the relative ...
1
vote
1
answer
63
views
Unexpected behaviour for numpy/polars correlation given large values
both for polars and numpy, correlation functions seem to break down given very large changes to the location.
I presume that has to do with precision issues, as e.g. a bazillion +1 is viewed as equal ...
1
vote
2
answers
379
views
Transforming polars Dataframe to Nested JSON Format
I have a dataframe that contains a product name, question, and answers. I would like to process the dataframe and transform it into a JSON format. Each product should have nested sections for ...
3
votes
2
answers
98
views
How to include first matching pattern as a column
I have a dataframe df.
>>> import polars as pl
>>>
>>>
>>> df = pl.DataFrame({"col": ["row1", "row2", "row3"]})
>>&...
2
votes
1
answer
221
views
Is there a way to vertically merge two Polars LazyFrames?
I want to vertically merge two polars.LazyFrames in order to avoid collecting both LazyFrames beforehand, which is computationally expensive. I have tried extend(), concat(), and vstack() but none of ...
1
vote
2
answers
201
views
Cumulative Elementwise Sum by Python Polars
I have a weight vector:
weight_vec = pl.Series("weights", [0.125, 0.0625, 0.03125])
And also a DataFrame containing up to m variables. For simplicity, we will only have two varaibles:
df = ...
2
votes
1
answer
202
views
Fill gaps in time series data in a Polars Lazy- / Dataframe
I am in a situation where I have some time series data, potentially looking like this:
{
"t": [1, 2, 5, 6, 7],
"y": [1, 1, 1, 1, 1],
}
As you can see, the time stamp jumps ...
1
vote
1
answer
147
views
SqlAlchemy Table Object Doesn't Synchronise with BigQuery
I'm reading data from Google BigQuery into a polars dataframe. Using a string query succeeds. I'd prefer to use an alchemy statement. Using python-bigquery-sqlalchemy provided by Google and following ...
2
votes
2
answers
291
views
Join large partitioned parquet datasets in Polars and write to Postgres?
I have two large datasets stored in partitioned Parquet format on S3, partitioned by category_id. I need to join them on category_id and label_id using Polars and write the results to Postgres.
The ...
2
votes
0
answers
221
views
python polars numerous joins crashing
This is for a POC to see if polars can do some things faster/better/cheaper than a current SQL solution. The first test case involves a count(*) over an eight-table join. The eight tables are ...
1
vote
1
answer
72
views
How to add a new level to JSON output using Polars in Python?
I'm using Polars to process a DataFrame so I can save it as JSON. I know I can use the method .write_json(), however, I would like to add a new level to the JSON.
My current approach:
import polars as ...
5
votes
1
answer
354
views
Adding hours to a Polars time column
I have a table representing a schedule, i.e. it contains day (monday-sunday), start_time and end_time fields
df = pl.DataFrame({
"day": ["monday", "tuesday", "...
0
votes
2
answers
113
views
Polars lazy dataframe custom function over rows
I am trying to run a custom function on a lazy dataframe on a row-by-row basis.
Function itself does not matter, so I'm using softmax as a stand-in. All that matters about it is that it is not ...
1
vote
0
answers
144
views
Using multithreading in Polars Expression Plugins
UPDATE:
See this SO post where the streaming engine is used:
How do I ensure that a Polars expression plugin properly uses multiple CPUs?
Orginial post:
I want to write a custom Polars Expression ...
0
votes
2
answers
90
views
Polars import from list of list into Colums [closed]
I am trying to import a list of lists into Polars and get the data in seperate columns.
Example.
numbers = [['304-144635', 0], ['123-091523', 7], ['305-144931', 12], ['623-101523', 16], ['305-145001', ...
1
vote
2
answers
181
views
How can I perform a calculation on a rolling window over a partition in polars?
I have a Dataset containing GPS Coordinates of a few planes. I would like to calculate the bearing of each plane at every point in time.
The Dataset as among others these columns:
event_uid
plane_no
...
-2
votes
2
answers
459
views
How can I convert a Polars dataframe to a column-oriented JSON object? [closed]
I'm trying to convert a Polars dataframe to a JSON object, but I can't seem to find a way to change the format of it between row/col orientation. In Pandas, by default, it creates a column-oriented ...
1
vote
1
answer
93
views
How to conditionally choose which column to backward fill over in polars?
I need to backfill a column over one of three possible columns, based on which one matches the non-null cell in the column to be backfilled.
My dataframe looks something like this:
import polars as pl
...
1
vote
1
answer
119
views
Can polars have a boolean in a 'with_columns' statement?
I am using polars to hash some columns in a data set. One column is contains lists of strings and the other column strings. My approach is to cast each column as type string and then hash the columns....
4
votes
1
answer
112
views
Grouped Rolling Mean in Polars
Similar question is asked here
However it didn't seem to work in my case.
I have a dataframe with 3 columns, date, groups, prob. What I want is to create a 3 day rolling mean of the prob column values ...
2
votes
2
answers
238
views
How to add a group-specific index to a polars dataframe with an expression instead of a map_groups user-defined function?
I am curious whether I am missing something in the Polars Expression library in how this could be done more efficiently. I have a dataframe of protein sequences, where I would like to create k-long ...
2
votes
1
answer
79
views
polars cum sum to create a set and not actually sum
I'd like to use a function like cumsum, but that would create a set of all values contained in the column up to the point, and not to sum them
df = pl.DataFrame({"a": [1, 2, 3, 4]})
df["...
2
votes
1
answer
306
views
polars casting a list to string
I am working in Polars and I have data set where one column is lists of strings. To see what it's like:
import pandas as pd
list_of_lists = [['base', 'base.current base', 'base.current base....
3
votes
1
answer
383
views
How to full join / merge two frames with polars while updating left with right values?
So I got two csv which I load as polars frames:
left:
left_csv = b"""
track_name,type,yield,group
8CEB45v1,corn,0.146957,A
A188v2,corn,0.86308,A
B73v6,corn,0.326076,A
CI6621v1,sweetcorn,...
3
votes
3
answers
526
views
Get column type using a Polars expression
I am trying to get the shrinked data type of a column using an expression, to be able to run validations against it.
import polars as pl
df = pl.DataFrame({"list_column": [[1, 2], [3, 4], [...
1
vote
1
answer
214
views
Polars interpolate_by fails when null values are at the beginning or end of a column
I've noticed some unexpected behavior with the interpolate_by expression and I'm not sure what is going on.
import polars as pl
df = pl.DataFrame({
'a': [1, 2, 3, 4, 5],
'b': [4, 5, None, 7, ...
0
votes
1
answer
137
views
How to convert float columns without decimal into int columns in Polars? [closed]
The following pandas code removes all the .0 decimal precision if I have a float column with 1.0, 2.0, 3.0 values:
import pandas as pd
df = pd.DataFrame({
"date": ["2025-01-01"...
4
votes
1
answer
101
views
How do I write a query like (A or B) and C in Polars?
I expected either a or b would be 0.0 (not NaN) and c would always be 0.0. The Polars documentation said to use | as "or" and & as "and". I believe I have the logic right:
(((...
4
votes
1
answer
499
views
How can I iterate over all columns using pl.all() in Polars?
I've written a custom function in Polars to generate a horizontal forward/backward fill list of expressions. The function accepts an iterable of expressions (or column names) to determine the order of ...
2
votes
2
answers
94
views
Create column from other columns created within same `with_columns` context
Here, column "AB" is just being created and at the same time is being used as input to create column "ABC". This fails.
df = df.with_columns(
(pl.col("A")+pl.col("...
2
votes
1
answer
91
views
How to perform row aggregation across the largest x columns in a polars data frame?
I have a data frame with 6 value columns and I want to sum the largest 3 of them. I also want to create an ID matrix to identify which columns were included in the sum.
So the initial data frame may ...
1
vote
1
answer
154
views
How to instantiate a single element Array/List in Polars expressions efficiently?
I need to convert each element in a polars df into the following structure:
{
"value": "A",
"lineItemName": "value",
"dimensions": [
...
4
votes
3
answers
112
views
Python Polars Encoding Continous Variables from Breakpoints in another DataFrame
The breakpoints data is the following:
breakpoints = pl.DataFrame(
{
"features": ["feature_0", "feature_0", "feature_1"],
"breakpoints&...
2
votes
1
answer
75
views
How to apply a custom function across multiple columns
How to extend this
df = df.select(
pl.col("x1").map_batches(custom_function).alias("new_x1")
)
to something like
df = df.select(
pl.col("x1","x2")....
2
votes
1
answer
335
views
Airflow DAG gets stuck when filtering a Polars DataFrame
I am dynamically generating Airflow DAGs based on data from a Polars DataFrame. The DAG definition includes filtering this DataFrame at DAG creation time and again inside a task when the DAG runs.
...
0
votes
1
answer
194
views
Handling Occasional 100 MB API Responses in FastAPI: Polars vs. NumPy/Pandas?
I'm working on an asynchronous FastAPI project that fetches large datasets from an API. Currently, I process the JSON response using a list comprehension and NumPy to extract device IDs and names. For ...
2
votes
1
answer
61
views
Null-aware Evaluation flawed in Polars 1.22.0?
I observed that the polars expression:
pl.DataFrame(data={}).select(a=pl.lit(None) | pl.lit(True))
evaluates to True, but it should evaluate to None in my estimation,
based on the concept of "...
3
votes
3
answers
725
views
Create new Polars columns by mapping values in a (delimited) string column using a dictionary
Sorry if the title is confusing.
I'm pretty familiar with Pandas and think I have a solid idea of how I would do this there. Pretty much just brute-force iteration and index-based assignment for the ...
2
votes
2
answers
179
views
confused by silent truncation in polars type casting
I encountered some confusing behavior with polars type-casting (silently truncating floats to ints without raising an error, even when explicitly specifying strict=True), so I headed over to the ...
0
votes
1
answer
310
views
Passing a polars struct to a user-defined function using map_batches
I need to pass a variable number of columns to a user-defined function. The docs mention to first create a pl.struct and subsequently let the function extract it. Here's the example given on the ...