3

I have a pandas dataframe that looks like this:

    A   B   C   D
0   1   2   3   0
1   4   5   6   1
2   7   8   9   2
3   10  10  10  0
4   10  10  10  1
5   1   2   3   0
6   4   5   6   1
7   7   8   8   2

I would like to remove all the set of rows that, in column 'D', are not -> 0,1,2 in this specific order;

The new dataframe I would like to obtain should look like this:

    A   B   C   D
0   1   2   3   0
1   4   5   6   1
2   7   8   9   2
3   1   2   3   0
4   4   5   6   1
5   7   8   8   2

.. because after row 3 and 4, row 5 did not have 2 in column 'D'.

3
  • What have you tried? I posted an answer with a partial solution, but if you've already come up with a better algorithm, I could fill in the gaps. Commented Aug 21 at 22:16
  • It'd also help to show what research you've done, so we can follow keywords that helped and rule out ones that didn't. For reference, see How to make good reproducible pandas examples; you're already good on most points in the top answer, just missing "show the code you've tried" and "show you've done some research". Commented Aug 21 at 22:17
  • thanks for your ideas, I have tried to loop through the rows with .itertuples() and create some sort of logic that allows to remove all the set of rows that do not follow that order. Commented Aug 21 at 22:22

6 Answers 6

4

A possible solution based on numpy:

w = np.lib.stride_tricks.sliding_window_view(df['D'], 3)
idx = np.flatnonzero((w == (0,1,2)).all(1)) # starting indexes of seq 0, 1, 2
df.iloc[(idx[:, None] + np.arange(3)).ravel()].reset_index(drop=True)

This uses numpy’s sliding_window_view to create a rolling 3-element view over the D column, then checks which windows match the sequence (0,1,2) by comparing element-wise and applying all along axis 1; the indices of the matching windows are obtained with flatnonzero. These starting indices are then expanded into full triplets with broadcasting, and the corresponding rows are selected from the dataframe using iloc, before reindexing cleanly with reset_index.

Output:

   A  B  C  D
0  1  2  3  0
1  4  5  6  1
2  7  8  9  2
3  1  2  3  0
4  4  5  6  1
5  7  8  8  2

Intermediates:

# w == (0,1,2)

array([[ True,  True,  True],
       [False, False, False],
       [False, False, False],
       [ True,  True, False],
       [False, False, False],
       [ True,  True,  True]])

# idx[:, None]

array([[0],
       [5]])

# + np.arange(3)

array([[0, 1, 2],
       [5, 6, 7]])

# .ravel()

array([0, 1, 2, 5, 6, 7])

To turn this solution more general

seq = (0,1,2)
n = len(seq)

then:

  • .sliding_window_view(..., n)
  • w == seq
  • np.arange(n)

(thanks @wjandrea)

Sign up to request clarification or add additional context in comments.

3 Comments

This is really elegant! It's a shame the NumPy syntax is so ugly :p
True, thanks! Another approach that works is convolution, but I thought that would be much more complicated to explain, as one has to make sure the score of the convolution does identity the sequence 0, 1, 2 uniquely.
I posted a more Pandas-idiomatic version of this here. Thanks for the inspiration!
1

For what it's worth, this solution works in this case, but not more generally:

You can check that rows in D fit the sequence by shifting:

mask = df['D'].pipe(lambda s:
    (s.eq(0) & s.shift(-1).eq(1) & s.shift(-2).eq(2)) |
    (s.eq(1) & s.shift(-1).eq(2)) |
    s.eq(2)
)
df[mask]
   A  B  C  D
0  1  2  3  0
1  4  5  6  1
2  7  8  9  2
5  1  2  3  0
6  4  5  6  1
7  7  8  8  2

Where it breaks

I think there are multiple ways, but the first I thought of is adding another row with a 2 at the end:

8   10  10  10  2

This code doesn't remove it, since | s.eq(2) isn't strict enough.

    A   B   C  D
0   1   2   3  0
1   4   5   6  1
2   7   8   9  2
5   1   2   3  0
6   4   5   6  1
7   7   8   8  2
8  10  10  10  2

1 Comment

I posted a better answer here
1

.rolling, check if sequence matches

The expected sequence is 0, 1, 2, so you can create a rolling window of that size then check if each window matches. That gives you the ending points, and you just need to backfill to the size of the window.

seq = (0, 1, 2)
n = len(seq)
mask = (
    df['D'].rolling(n)
    .apply(lambda s: s.eq(seq).all())
    .astype('boolean')  # Nullable boolean, to allow backfilling
    .where(lambda s: s)  # False -> NA
    .bfill(limit=n-1)
    .fillna(False)  # NA -> False
)
df[mask]
   A  B  C  D
0  1  2  3  0
1  4  5  6  1
2  7  8  9  2
5  1  2  3  0
6  4  5  6  1
7  7  8  8  2

This is based on PaulS's answer, but using Pandas's fluent interface, and using bfill instead of broadcasting with arange(n).

If you need to optimize

Start by switching to NumPy in the apply:

.apply(lambda s: (s == seq).all(), raw=True)

For a df of 60,000 rows, this took it from 22s to 1s on my machine.

1 Comment

Nice solution and based exclusively on pandas!
1

You can check .shift(-n) == n for 0, 1, 2 to identify the start of each run and then shift forward to mark the remaining rows.

size = 3

mask = True
for n in range(size):
    mask &= df["D"].shift(-n) == n

for n in range(1, size):
    mask |= mask.shift()

df[mask]
   A  B  C  D
0  1  2  3  0
1  4  5  6  1
2  7  8  9  2
5  1  2  3  0
6  4  5  6  1
7  7  8  8  2

The first for loop will leave you with True for each 0 row in 0, 1, 2.

>>> mask_df = pd.concat([df["D"].shift(-n) == n for n in range(size)], axis=1)
>>> mask_df 
       D      D      D
0   True   True   True  # <-
1  False  False  False
2  False  False  False
3   True   True  False
4  False  False  False
5   True   True   True  # <-
6  False  False  False
7  False  False  False
>>> mask1 = mask_df.all(axis=1)
>>> mask1
0     True
1    False
2    False
3    False
4    False
5     True
6    False
7    False
dtype: bool

The second for loop shifts the True values forward for 1, 2.

>>> mask2 = pd.concat([mask1.shift(n) for n in range(size)], axis=1).any(axis=1)
>>> mask2
0     True
1     True
2     True
3    False
4    False
5     True
6     True
7     True
dtype: bool

Setup

df = pd.DataFrame({
    'A': [1, 4, 7, 10, 10, 1, 4, 7],
    'B': [2, 5, 8, 10, 10, 2, 5, 8],
    'C': [3, 6, 9, 10, 10, 3, 6, 8],
    'D': [0, 1, 2, 0, 1, 0, 1, 2]})

Comments

0

add mask for your data, filter out mask and apply mask:

# pattern to filter out
pattern = [0, 1, 2]
pattern_length = len(pattern)

# Create a mask to identify valid sequences
mask = pd.Series(False, index=df.index)

for i in range(len(df) - pattern_length + 1):
    # Check if the current window matches our pattern
    if all(df['D'].iloc[i + j] == pattern[j] for j in range(pattern_length)):
        mask.iloc[i : i + pattern_length] = True

# Apply the mask to filter the dataframe
result = df[mask].reset_index(drop=True)

print(result)

2 Comments

You don't need to loop over range(pattern_length), just use slicing: df['D'].iloc[i : i+pattern_length].eq(pattern).all(). I haven't tested this; you might need to convert pattern to tuple for it to work.
This is quite similar to my answer that uses .rolling, but using more vanilla Python idioms.
0

Since you have a specific order, and nothing randomized, we can use groupby to create groups based on your specifications and check if the group that are equal to your order.

First we start by creating groups based on the first value (D == 0 in this case) and any value greater than the last value of your order (D > 2) to ensure that the numbers we don't need won't affect the next step.

After that we groupby and convert every group to tuples or lists and we check if each set is equal with your order.

At the end, filter the groups that are not equal to your specific order, and create a mask with the ones that are equal to filter your dataframe.

order = (0, 1, 2)
g = (df['D'].eq(0) | df['D'].gt(2)).cumsum()
m = df['D'].groupby(g).apply(tuple) == order
result = df[g.isin(m[m].index)]

End result:

 A  B  C  D
 1  2  3  0
 4  5  6  1
 7  8  9  2
 1  2  3  0
 4  5  6  1
 7  8  8  2

Note: This solution won't work for sequences that contain numbers that are bigger than the numbers that are outside of your sequence.

Example: Sequence 1, 5, 4 won't work if the numbers outside of it contains numbers that are smaller than the numbers in your sequence (0, 2, 3)

It will also not work for sequences that contain duplicate numbers. Example: 0, 0, 1, 2, 3

If that is case, your only option is to use the Pandas or Numpy rolling.

3 Comments

Why are you using set? That ignores duplicates and order, so for example, 0 2 2 1 is allowed, but shouldn't be.
This also allows 0 1 2 2, but from what I understood, only the first part, 0 1 2 should be selected.
You're right. I forgot to mention that there are cases that my code can't handle. Thank you for pointing that out.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.