2,817 questions
1
vote
0
answers
39
views
How to show the streaming parts of a polars query using explain()?
I am trying to explain() a Polars query to see which operations can be executed using the streaming engine. Currently, I am only able to do this using show_graph().
From sources on the web, I see that ...
1
vote
1
answer
59
views
Polars parse multiple datetime format [duplicate]
I have string column in polars dataframe with multiple datetime formats and I am using following code to convert datatype of column from string into datetime.
import polars as pl
df = pl.from_dict({'...
0
votes
0
answers
70
views
polars.LazyFrame.sink_csv does not give CRLF line termination [duplicate]
I have a Python file
import polars as pl
import requests
from pathlib import Path
url = "https://raw.githubusercontent.com/leanhdung1994/files/main/processedStep1_enwiktionary_namespace_0_43....
1
vote
3
answers
162
views
Polars: how to write a column of strings into a txt file without escaping?
I have a .ndjson files with millions of rows. Each row has a field html which contains html strings. I would like to write all such html into a .txt file. One html is into one line of the .txt file. I ...
2
votes
1
answer
131
views
Why does a nearest join_asof() return exact matches despite allow_exact_matches=False?
I am looking for the nearest non exact match on the dates column:
import polars as pl
df = pl.from_repr("""
┌─────┬────────────┐
│ uid ┆ dates │
│ --- ┆ --- │
│ i64 ┆ date ...
-2
votes
1
answer
80
views
polars.exceptions.DuplicateError: column with name 'name_ID' has more than one occurrence [closed]
I have a dictionary of polars.DataFrames called data_dict.
All dataframes inside the dict values are having an extra index column ''.
I want to drop that column and set a new column named 'name_ID'
...
2
votes
1
answer
76
views
Change color of single line in altair line chart based on other indicator column
Imagine having the following polars dataframe "df" that contains the temperature of a machine that is either "active" or "inactive":
import polars as pl
from datetime ...
1
vote
0
answers
76
views
Is it possible to drop/select columns where col.n_unique > 1 with native polars syntax [duplicate]
I have a table that looks like this
import polars as pl
df = pl.DataFrame(
{
"col1": [1, 2, 3, 4, 5],
"col2": [10, 20, 30, 40, 50],
"col3": [...
Advice
0
votes
7
replies
114
views
High volume URL parsing in Python
I use the polars, urllib and tldextract packages in python to parse 2 columns of URL strings in zstd-compressed parquet files (averaging 8GB, 40 million rows). The parsed output include the scheme, ...
12
votes
0
answers
326
views
Not displaying DataFrame's name in Data Wrangler extension of VSCode, displaying "Data grid"
It is a while that I am using Data Wrangler extension in VS Code; it is very useful for analyzing datasets and filtering some columns to see the features. When I opened a dataframe in it, it used to ...
1
vote
1
answer
99
views
Altair stacked bar chart in custom order
I've built a dataset in Polars (python), attempting to plot it as a stacked horizontal bar chart using Polars' built-in Altair plot function, however trying to specify a custom sort order for the ...
1
vote
1
answer
109
views
Polars print changed values between 2 dataframes
Given two polars dataframes of the same shape, I would like to print the number of values different between the two, including missing values that are not missing in the other dataframe.
I came up ...
2
votes
2
answers
91
views
Seeking more efficient method in Python & Polars to perform monthly comparison within each year
I have a CSV of energy consumption data over time (each month for several years).
I want to determine the percentage (decimal portion) for each month across that year; e.g., August was 12.3% of the ...
1
vote
3
answers
100
views
Show matched rows in polars join
When you join two tables, STATA prints the number of rows merged and unmerged.
For instance, take Example 1 at page 13 of the STATA merge doc:
use https://www.stata-press.com/data/r19/autosize
merge 1:...
3
votes
0
answers
146
views
Why polars join function performance deteriorates so much from version 1.30.0 to 1.31.0?
I noticed a significant performance deterioration when using polars dataframe join function after upgrading polars from 1.30.0 to 1.31.0. The code snippet is below:
import polars as pl
import time
...
1
vote
3
answers
159
views
Replace value by condition across entire polars df
I'd like to replace any value greater than some condition with zero for any column except the date column in a df.
The closest I've found it
df.with_columns(
pl.when(pl.any_horizontal(pl.col(pl....
2
votes
1
answer
129
views
Find differing rows between two Polars DataFrames based on ID and multiple columns
I have two Polars DataFrames (df1 and df2) with the same columns.
I want to compare them by ID and Iname, and get the rows where any of the other columns (X, Y, Z) differ between the two.
import ...
0
votes
0
answers
163
views
How to efficiently get the last row of a rolling aggregation group without .last()?
I'm working with a large Polars LazyFrame and computing rolling aggregations grouped by customer (Cusid). I need to find the "front" of the rolling window (last Tts_date) for each group to ...
6
votes
1
answer
106
views
Polars streaming: How to compute a nested window aggregation while avoiding in-memory-maps?
I want to calculate the mean over some group column 'a' but include only one value per second group column 'b'.
Constraints:
I want to preserve all original records in the result.
(if possible) avoid ...
4
votes
3
answers
106
views
Extending polars DataFrame while maintaining variables between calls
I would like to code a logger for polars using the Custom Namespace API.
For instance, starting from:
import logging
import polars as pl
penguins_pl = pl.read_csv("https://raw.githubusercontent....
0
votes
1
answer
73
views
Python tempfile TemporaryDirectory path changes multiple times after initialization
I am using tempfile with Polars for the first time and getting some surprising behavior when running it in a serverless Cloud Function-like environment. Here is my simple test code:
try:
with ...
4
votes
4
answers
177
views
Reference column named "*" in Polars
I have a Polars DataFrame with a column named "*" and would like to reference just that column. When I try to use pl.col("*") it is interpreted as a wildcard for "all columns.&...
1
vote
2
answers
84
views
Adding an Object column to a polars DataFrame with broadcasting
If I have a DataFrame, I can create a column with a single value like this:
df = pl.DataFrame([[1, 2, 3]])
df.with_columns(pl.lit("ok").alias("metadata"))
shape: (3, 2)
┌──────────...
1
vote
0
answers
75
views
Polars LazyFrame sink_parquet + PartitionByKey slower to S3 than local disk
I'm wondering why I'm seeing such poor performance when writing a LazyFrame using PartitionByKey to S3 when compared to other methods. Here is a simple test script that writes out some random data to ...
1
vote
2
answers
113
views
python typing distinctions between inline created parameters and variables
Preamble
I'm using polars's write_excel method which has a parameter column_formats which wants a ColumnFormatDict that is defined here and below
ColumnFormatDict: TypeAlias = Mapping[
# dict of ...
2
votes
0
answers
180
views
Speeding up Polars rust plugin branching and aggregating
I'm following polars plugins tutorial - branch mispredictions and it says that theres a faster way to implement the following code:
#[polars_expr(output_type=Int64)]
fn sum_i64(inputs: &[Series]) -...
-1
votes
1
answer
123
views
Compare 2 columns in Polars and rearrange them when they match and unmatch?
A Polars DataFrame that has 2 columns [Col01 & Col02]. They hold same values though not the same number of times [e.g. Col01 can have say 5 rows of '00000'while Col02 may have 20 rows of '00000' ...
8
votes
1
answer
256
views
How to write a pandas-compatible, non-elementary expression in narwhals
I'm working with the narwhals package and I'm trying to write an expression that is:
applied over groups using .over()
Non-elementary/chained (longer than a single operation)
Works when the native df ...
-2
votes
1
answer
127
views
Polars scan_ndjson Out of memory
Description
Trying to read 32GB of data splitted in 16 .jsonl files.
I use the function scan_ndjson of Polars but the execution stops with error 137 (Out of memory).
Here is the code:
# Count infobox ...
3
votes
3
answers
159
views
Calculating monthly revenue given start and end date for each ID using Polars
I have a dataframe using this format
import polars as pl
df = pl.from_repr("""
┌─────┬────────────┬────────────┬──────────┐
│ ID ┆ DATE_PREV ┆ DATE ┆ REV_DIFF │
│ --- ┆ --- ...
2
votes
1
answer
87
views
polars-u64-idx not available for latest version
While the standard Polars package is available in version 1.34.0 the polars-u64-idx package is missing the latest versions.
Does anyone know if this package is discontinued?
2
votes
2
answers
237
views
How do I get polars.Expr.str.json_decode to decode simple map to List(Struct({'key': String, 'value': Int32}))?
json_decode requires that we specify the dtype.
Polars represents maps with arbitrary keys as a List<struct<2>> (see here).
EDIT: Suppose I don't know the keys in my JSON ahead of time, ...
2
votes
1
answer
123
views
How to perform sinking lazyframes with diverging queries to different partitions
I have a very big parquet file which I'm attempting to read from and split into partitioned folders on a column "token".
Currently I'm using pl.scan_parquet on the big parquet file followed ...
2
votes
3
answers
117
views
Forward fill using values from rows that match a condition in Polars
I have this dataframe:
import polars as pl
df = pl.DataFrame({'value': [1,2,3,4,5,None,None], 'flag': [0,1,1,1,0,0,0]})
┌───────┬──────┐
│ value ┆ flag │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═══════╪══...
2
votes
1
answer
68
views
How to select joined columns with structure like namespaces (a.col1, b.col2)?
I am working to migrate from PySpark to Polars. In PySpark I often use aliases on dataframes so I can clearly see which columns come from which side of a join. I'd like to get similarly readable code ...
0
votes
0
answers
113
views
Enabling Delta Table checkpointing when using polars write_delta()
I am using polars.df.write_delta() to initially create, and subsequently append to, Delta Tables in Microsoft Fabric OneLake storage, via a Fabric python notebook.
Having had a production process up ...
1
vote
1
answer
97
views
Converting a Rust `futures::TryStream` to a `polars::LazyFrame`
I have an application where I have a futures::TryStream. Still in a streaming fashion, I want to convert this into a polars::LazyFrame. It is important to note that the TryStream comes from the ...
0
votes
1
answer
117
views
PyCharm "view as DataFrame" shows nothing for polars DataFrames
Basically the title. Using PyCharm 2023.3.3 I'm not able to see the data of polars DataFrames.
As an example, I've a simple DataFrame like this:
print(ids_df)
shape: (1, 4)
┌───────────────────────────...
3
votes
3
answers
92
views
Dynamically index a column in Polars
I have a simple dataframe look like this:
import polars as pl
df = pl.DataFrame({
'ref': ['a', 'b', 'c', 'd', 'e', 'f'],
'idx': [4, 3, 1, 6, 2, 5],
})
How can I obtain the result as ...
2
votes
1
answer
104
views
Find nearest / closest value to subset of values in a Polars dataframe
I have this dataframe
import polars as pl
df = pl.from_repr("""
┌────────────┬──────┐
│ date ┆ ME │
│ --- ┆ --- │
│ date ┆ i64 │
╞════════════╪══════╡
│ 2027-11-...
3
votes
0
answers
66
views
How to repeat List in Polars [duplicate]
I am trying to repeat the values of a List in polars. The equivalent operation in pure python would be:
[1,2,3,4] * 3 -> [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4].
So the content of the list is repeated ...
0
votes
1
answer
96
views
How to extract & coalesce deeply nested values that may not exist? [closed]
I'm trying to extract some data from deeply nested JSON - this works:
lf.with_columns(
[
pl.coalesce(
[
pl.col("a"),
pl.col("...
0
votes
1
answer
117
views
Show progress bar when reading files with globbing with polars
I have a folder with multiple Excel files.
I'm reading all of them in a single polars DataFrame concatenated vertically using globbing:
import polars as pl
df = pl.read_excel("folder/*.xlsx")...
4
votes
2
answers
117
views
How to create a cross table with percentages in Polars?
I would like to create a cross table that shows, in each cell, the percentages of rows over the total number of rows.
Inspired by this post I started with:
df = pl.DataFrame({"a": [2, 0, 1, ...
3
votes
3
answers
183
views
Drop column by index in polars
I need to drop the first column in a polars DataFrame.
I tried:
result = df.select([col for idx, col in enumerate(df.columns) if idx != 0])
But it looks long and clumsy for such a simple task?
I also ...
1
vote
1
answer
121
views
group_by with polars concatenating values
I have a polars dataframe that I want to group by and concatenate the unique values in as a single entry.
in pandas, I go:
def unique_colun_values(x):
return('|'.join(set(x)))
dd=pd.DataFrame({'...
4
votes
3
answers
121
views
How can I efficiently get both a column and a scalar using Polars expressions?
Polars suggests the usage of Expressions to avoid eager execution and then execute all expressions together at the very end.
I am unsure how this is possible if I want a column and a scalar. For ...
0
votes
4
answers
206
views
Recursively rename all column names and nested struct fields to lowercase in a Polars DataFrame? [closed]
Is there a way for Polars to rename all columns, not just at the top level, but including multiple levels of nested structs?
I need them to all be lowercase via str.lower
3
votes
1
answer
150
views
write_database(..., engine="adbc") with autocommit=False
In polars, I would like to use pl.write_database multiple times with engine="adbc" in the same session and then commit all at the end with conn.commit(), i.e. do a manual commit.
import ...
2
votes
1
answer
176
views
Memory efficient sorting/removing duplicates of polars dataframes
I am trying to import very large csv files into parquet files using polars. I stream data, use lazy dataframes and sinks. No problem until...
...sorting the dataframe on a column and removing ...