Skip to main content
Filter by
Sorted by
Tagged with
2 votes
0 answers
189 views

I am writing Parquet files using two different frameworks—Apache Spark (Scala) and Polars (Python)—with the same schema and data. However, when I query the resulting Parquet files using Apache ...
user29976558's user avatar
2 votes
1 answer
205 views

I'm slowly migrating to polars from pandas and I have found that in some cases the polars syntax is tricky. I'm seeking help to do a group_by followed by a describe using less (or more readable) code. ...
jcaliz's user avatar
  • 4,073
5 votes
2 answers
227 views

The code below shows a solution I have found in order to expand a dataframe to include the cartesian product of columns A and B, filling in the other columns with null values. I'm wondering if there ...
rindis's user avatar
  • 1,159
0 votes
3 answers
256 views

I have a dataset, part of which looks like this: customer product price quantity sale_time C060235 P0204 6.99 2 2024-03-11 08:24:11 C045298 P0167 14.99 1 2024-03-11 08:35:06 ... C039877 P0024 126.95 1 ...
Scott Deerwester's user avatar
0 votes
0 answers
216 views

I read data from the same parquet files multiple times using polars (polars rust engine and pyarrow) and using pandas pyarrow backend (not fastparquet as it was very slow), see below code. All the ...
newandlost's user avatar
  • 1,080
0 votes
1 answer
82 views

I am trying to use polars~=1.24.0 on Python 3.13 to process larger-than-memory sized datasets. Specifically, I am loading many (i.e., 35 of them) parquet files via the polars.scan_parquet('base-name-*....
Arda Aytekin's user avatar
  • 1,303
1 vote
1 answer
105 views

Here is the data import polars as pl from datetime import datetime df = pl.DataFrame( { "time": [ datetime(2021, 2, 1), datetime(2021, 4, 2), ...
JohnRos's user avatar
  • 1,257
1 vote
1 answer
588 views

I have a CSV (or rather TSV) I got from stripping the header off a gVCF with bcftools view foo.g.vcf -H > foo.g.vcf.csv A head gives me this, so everything looks like expected so far chr1H 1 ...
skranz's user avatar
  • 65
0 votes
1 answer
81 views

I have a table "Data" containing arrays of FOOs, and a separate table "Lookup" where I can find the BAR for each FOO. I want to write a SQL query which returns the Data table, but ...
DarthVlader's user avatar
2 votes
2 answers
155 views

I am working with Polars and need to ensure that my dataset contains all possible combinations of unique values in certain index columns. If a combination is missing in the original data, it should be ...
Olibarer's user avatar
  • 423
2 votes
1 answer
100 views

I have some coordinate data; some of it high precision, some of it low precision thanks to multiple data sources and other operational realities. I want to have a column that indicates the relative ...
Kyle's user avatar
  • 1,012
1 vote
1 answer
63 views

both for polars and numpy, correlation functions seem to break down given very large changes to the location. I presume that has to do with precision issues, as e.g. a bazillion +1 is viewed as equal ...
Dontwannausemynormalnick's user avatar
1 vote
2 answers
379 views

I have a dataframe that contains a product name, question, and answers. I would like to process the dataframe and transform it into a JSON format. Each product should have nested sections for ...
Simon's user avatar
  • 1,209
3 votes
2 answers
98 views

I have a dataframe df. >>> import polars as pl >>> >>> >>> df = pl.DataFrame({"col": ["row1", "row2", "row3"]}) >>&...
user459872's user avatar
  • 25.9k
2 votes
1 answer
221 views

I want to vertically merge two polars.LazyFrames in order to avoid collecting both LazyFrames beforehand, which is computationally expensive. I have tried extend(), concat(), and vstack() but none of ...
realbitsurfer's user avatar
1 vote
2 answers
201 views

I have a weight vector: weight_vec = pl.Series("weights", [0.125, 0.0625, 0.03125]) And also a DataFrame containing up to m variables. For simplicity, we will only have two varaibles: df = ...
Kevin Li's user avatar
  • 649
2 votes
1 answer
202 views

I am in a situation where I have some time series data, potentially looking like this: { "t": [1, 2, 5, 6, 7], "y": [1, 1, 1, 1, 1], } As you can see, the time stamp jumps ...
Thomas's user avatar
  • 1,351
1 vote
1 answer
147 views

I'm reading data from Google BigQuery into a polars dataframe. Using a string query succeeds. I'd prefer to use an alchemy statement. Using python-bigquery-sqlalchemy provided by Google and following ...
eldrly's user avatar
  • 326
2 votes
2 answers
291 views

I have two large datasets stored in partitioned Parquet format on S3, partitioned by category_id. I need to join them on category_id and label_id using Polars and write the results to Postgres. The ...
Joost Döbken's user avatar
2 votes
0 answers
221 views

This is for a POC to see if polars can do some things faster/better/cheaper than a current SQL solution. The first test case involves a count(*) over an eight-table join. The eight tables are ...
sicsmpr's user avatar
  • 55
1 vote
1 answer
72 views

I'm using Polars to process a DataFrame so I can save it as JSON. I know I can use the method .write_json(), however, I would like to add a new level to the JSON. My current approach: import polars as ...
Simon's user avatar
  • 1,209
5 votes
1 answer
354 views

I have a table representing a schedule, i.e. it contains day (monday-sunday), start_time and end_time fields df = pl.DataFrame({ "day": ["monday", "tuesday", "...
David Waterworth's user avatar
0 votes
2 answers
113 views

I am trying to run a custom function on a lazy dataframe on a row-by-row basis. Function itself does not matter, so I'm using softmax as a stand-in. All that matters about it is that it is not ...
velochy's user avatar
  • 443
1 vote
0 answers
144 views

UPDATE: See this SO post where the streaming engine is used: How do I ensure that a Polars expression plugin properly uses multiple CPUs? Orginial post: I want to write a custom Polars Expression ...
thoooooooomas's user avatar
0 votes
2 answers
90 views

I am trying to import a list of lists into Polars and get the data in seperate columns. Example. numbers = [['304-144635', 0], ['123-091523', 7], ['305-144931', 12], ['623-101523', 16], ['305-145001', ...
diogenes's user avatar
  • 2,181
1 vote
2 answers
181 views

I have a Dataset containing GPS Coordinates of a few planes. I would like to calculate the bearing of each plane at every point in time. The Dataset as among others these columns: event_uid plane_no ...
jimfawkes's user avatar
  • 385
-2 votes
2 answers
459 views

I'm trying to convert a Polars dataframe to a JSON object, but I can't seem to find a way to change the format of it between row/col orientation. In Pandas, by default, it creates a column-oriented ...
Ghost's user avatar
  • 1,594
1 vote
1 answer
93 views

I need to backfill a column over one of three possible columns, based on which one matches the non-null cell in the column to be backfilled. My dataframe looks something like this: import polars as pl ...
epistemetrica's user avatar
1 vote
1 answer
119 views

I am using polars to hash some columns in a data set. One column is contains lists of strings and the other column strings. My approach is to cast each column as type string and then hash the columns....
MikeB2019x's user avatar
  • 1,297
4 votes
1 answer
112 views

Similar question is asked here However it didn't seem to work in my case. I have a dataframe with 3 columns, date, groups, prob. What I want is to create a 3 day rolling mean of the prob column values ...
AColoredReptile's user avatar
2 votes
2 answers
238 views

I am curious whether I am missing something in the Polars Expression library in how this could be done more efficiently. I have a dataframe of protein sequences, where I would like to create k-long ...
Olga Botvinnik's user avatar
2 votes
1 answer
79 views

I'd like to use a function like cumsum, but that would create a set of all values contained in the column up to the point, and not to sum them df = pl.DataFrame({"a": [1, 2, 3, 4]}) df["...
ClementWalter's user avatar
2 votes
1 answer
306 views

I am working in Polars and I have data set where one column is lists of strings. To see what it's like: import pandas as pd list_of_lists = [['base', 'base.current base', 'base.current base....
MikeB2019x's user avatar
  • 1,297
3 votes
1 answer
383 views

So I got two csv which I load as polars frames: left: left_csv = b""" track_name,type,yield,group 8CEB45v1,corn,0.146957,A A188v2,corn,0.86308,A B73v6,corn,0.326076,A CI6621v1,sweetcorn,...
Pm740's user avatar
  • 423
3 votes
3 answers
526 views

I am trying to get the shrinked data type of a column using an expression, to be able to run validations against it. import polars as pl df = pl.DataFrame({"list_column": [[1, 2], [3, 4], [...
yz_jc's user avatar
  • 271
1 vote
1 answer
214 views

I've noticed some unexpected behavior with the interpolate_by expression and I'm not sure what is going on. import polars as pl df = pl.DataFrame({ 'a': [1, 2, 3, 4, 5], 'b': [4, 5, None, 7, ...
nybhh's user avatar
  • 101
0 votes
1 answer
137 views

The following pandas code removes all the .0 decimal precision if I have a float column with 1.0, 2.0, 3.0 values: import pandas as pd df = pd.DataFrame({ "date": ["2025-01-01"...
Nyssance's user avatar
  • 401
4 votes
1 answer
101 views

I expected either a or b would be 0.0 (not NaN) and c would always be 0.0. The Polars documentation said to use | as "or" and & as "and". I believe I have the logic right: (((...
Steve Maguire's user avatar
4 votes
1 answer
499 views

I've written a custom function in Polars to generate a horizontal forward/backward fill list of expressions. The function accepts an iterable of expressions (or column names) to determine the order of ...
Olibarer's user avatar
  • 423
2 votes
2 answers
94 views

Here, column "AB" is just being created and at the same time is being used as input to create column "ABC". This fails. df = df.with_columns( (pl.col("A")+pl.col("...
Nip's user avatar
  • 474
2 votes
1 answer
91 views

I have a data frame with 6 value columns and I want to sum the largest 3 of them. I also want to create an ID matrix to identify which columns were included in the sum. So the initial data frame may ...
marinerbeck's user avatar
1 vote
1 answer
154 views

I need to convert each element in a polars df into the following structure: { "value": "A", "lineItemName": "value", "dimensions": [ ...
Vinz's user avatar
  • 487
4 votes
3 answers
112 views

The breakpoints data is the following: breakpoints = pl.DataFrame( { "features": ["feature_0", "feature_0", "feature_1"], "breakpoints&...
Kevin Li's user avatar
  • 649
2 votes
1 answer
75 views

How to extend this df = df.select( pl.col("x1").map_batches(custom_function).alias("new_x1") ) to something like df = df.select( pl.col("x1","x2")....
Nip's user avatar
  • 474
2 votes
1 answer
335 views

I am dynamically generating Airflow DAGs based on data from a Polars DataFrame. The DAG definition includes filtering this DataFrame at DAG creation time and again inside a task when the DAG runs. ...
elvainch's user avatar
  • 1,407
0 votes
1 answer
194 views

I'm working on an asynchronous FastAPI project that fetches large datasets from an API. Currently, I process the JSON response using a list comprehension and NumPy to extract device IDs and names. For ...
Foxbat's user avatar
  • 364
2 votes
1 answer
61 views

I observed that the polars expression: pl.DataFrame(data={}).select(a=pl.lit(None) | pl.lit(True)) evaluates to True, but it should evaluate to None in my estimation, based on the concept of "...
Silverdust's user avatar
  • 1,527
3 votes
3 answers
725 views

Sorry if the title is confusing. I'm pretty familiar with Pandas and think I have a solid idea of how I would do this there. Pretty much just brute-force iteration and index-based assignment for the ...
Sparky Parky's user avatar
2 votes
2 answers
179 views

I encountered some confusing behavior with polars type-casting (silently truncating floats to ints without raising an error, even when explicitly specifying strict=True), so I headed over to the ...
Max Power's user avatar
  • 9,146
0 votes
1 answer
310 views

I need to pass a variable number of columns to a user-defined function. The docs mention to first create a pl.struct and subsequently let the function extract it. Here's the example given on the ...
Andi's user avatar
  • 5,177

1
3 4
5
6 7
57