3

Consider the following dataframe.

df = pl.DataFrame(data={"col1": range(10)})
┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 0    │
│ 1    │
│ 2    │
│ 3    │
│ 4    │
│ 5    │
│ 6    │
│ 7    │
│ 8    │
│ 9    │
└──────┘

Let's say I have a list of tuples, where the first value represents the start index and the second value a length value (as used in pl.DataFrame.slice). This might look like this:

slices = [(1,2), (5,3)]

Now, what's a good way to slice/extract two chunks out of df, whereby the first slice starts in row 1 and has a length of 2, while the second chunk starts at row 5 and has a length of 3.

Here's what I am looking for:

┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 2    │
│ 5    │
│ 6    │
│ 7    │
└──────┘

1 Answer 1

5

You could use pl.DataFrame.slice to obtain each slice separately and then use pl.concat to concatenate all slices.

pl.concat(df.slice(*slice) for slice in slices)
shape: (5, 1)
┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 2    │
│ 5    │
│ 6    │
│ 7    │
└──────┘

Edit. As an attempt for a vectorized approach, you could first use the list of slice parameters to create a dataframe of indices (using pl.int_ranges and pl.DataFrame.explode). Afterwards, this dataframe of indices can be used to slice the df with join.

indices = (
    pl.DataFrame(slices, orient="row", schema=["offset", "length"])
    .select(
        index=pl.int_ranges("offset", pl.col("offset") + pl.col("length"))
    )
    .explode("index")
)
shape: (5, 1)
┌───────┐
│ index │
│ ---   │
│ i64   │
╞═══════╡
│ 1     │
│ 2     │
│ 5     │
│ 6     │
│ 7     │
└───────┘
(
    indices
    .join(
        df,
        left_on="index",
        right_on=pl.int_range(pl.len()),
        how="left",
        coalesce=True,
    )
    .drop("index")
)
shape: (5, 1)
┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 2    │
│ 5    │
│ 6    │
│ 7    │
└──────┘
Sign up to request clarification or add additional context in comments.

3 Comments

This is working. However, I am wondering if there exists a vectorized solution? Can I avoid the for-loop?
@Andi I've added an attempt for a vectorized solution in my latest edit. I am not 100% sure whether this avoids the loop over the list of slice parameters though (as it needs to be parsed, when creating the dataframe indices). I would be interested to hear if this lead to an increase in performance on your data.
The one-liner looks much more concise ;-) Anyway, thanks for showing an alternative.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.