Slicing multiple chunks in a polars dataframe

Question

Consider the following dataframe.

df = pl.DataFrame(data={"col1": range(10)})

┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 0    │
│ 1    │
│ 2    │
│ 3    │
│ 4    │
│ 5    │
│ 6    │
│ 7    │
│ 8    │
│ 9    │
└──────┘

Let's say I have a list of tuples, where the first value represents the start index and the second value a length value (as used in pl.DataFrame.slice). This might look like this:

slices = [(1,2), (5,3)]

Now, what's a good way to slice/extract two chunks out of df, whereby the first slice starts in row 1 and has a length of 2, while the second chunk starts at row 5 and has a length of 3.

Here's what I am looking for:

┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 2    │
│ 5    │
│ 6    │
│ 7    │
└──────┘

Hericks · Accepted Answer · 2024-07-05 14:39:05Z

5

You could use pl.DataFrame.slice to obtain each slice separately and then use pl.concat to concatenate all slices.

pl.concat(df.slice(*slice) for slice in slices)

shape: (5, 1)
┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 2    │
│ 5    │
│ 6    │
│ 7    │
└──────┘

Edit. As an attempt for a vectorized approach, you could first use the list of slice parameters to create a dataframe of indices (using pl.int_ranges and pl.DataFrame.explode). Afterwards, this dataframe of indices can be used to slice the df with join.

indices = (
    pl.DataFrame(slices, orient="row", schema=["offset", "length"])
    .select(
        index=pl.int_ranges("offset", pl.col("offset") + pl.col("length"))
    )
    .explode("index")
)

shape: (5, 1)
┌───────┐
│ index │
│ ---   │
│ i64   │
╞═══════╡
│ 1     │
│ 2     │
│ 5     │
│ 6     │
│ 7     │
└───────┘

(
    indices
    .join(
        df,
        left_on="index",
        right_on=pl.int_range(pl.len()),
        how="left",
        coalesce=True,
    )
    .drop("index")
)

shape: (5, 1)
┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 2    │
│ 5    │
│ 6    │
│ 7    │
└──────┘

edited Jul 5, 2024 at 14:39

answered Jul 5, 2024 at 13:25

Hericks

12.9k3 gold badges35 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Andi Over a year ago

This is working. However, I am wondering if there exists a vectorized solution? Can I avoid the for-loop?

Hericks Over a year ago

@Andi I've added an attempt for a vectorized solution in my latest edit. I am not 100% sure whether this avoids the loop over the list of slice parameters though (as it needs to be parsed, when creating the dataframe indices). I would be interested to hear if this lead to an increase in performance on your data.

Andi Over a year ago

The one-liner looks much more concise ;-) Anyway, thanks for showing an alternative.

Collectives™ on Stack Overflow

Slicing multiple chunks in a polars dataframe

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related