removing rows that don't fit the repeating sequence in pandas dataframe

Question

I have a pandas dataframe that looks like this:

    A   B   C   D
0   1   2   3   0
1   4   5   6   1
2   7   8   9   2
3   10  10  10  0
4   10  10  10  1
5   1   2   3   0
6   4   5   6   1
7   7   8   8   2

I would like to remove all the set of rows that, in column 'D', are not -> 0,1,2 in this specific order;

The new dataframe I would like to obtain should look like this:

    A   B   C   D
0   1   2   3   0
1   4   5   6   1
2   7   8   9   2
3   1   2   3   0
4   4   5   6   1
5   7   8   8   2

.. because after row 3 and 4, row 5 did not have 2 in column 'D'.

What have you tried? I posted an answer with a partial solution, but if you've already come up with a better algorithm, I could fill in the gaps. — wjandrea
– wjandrea, Commented Aug 21 at 22:16
It'd also help to show what research you've done, so we can follow keywords that helped and rule out ones that didn't. For reference, see How to make good reproducible pandas examples; you're already good on most points in the top answer, just missing "show the code you've tried" and "show you've done some research". — wjandrea
– wjandrea, Commented Aug 21 at 22:17
thanks for your ideas, I have tried to loop through the rows with .itertuples() and create some sort of logic that allows to remove all the set of rows that do not follow that order. — AjWinston
– AjWinston, Commented Aug 21 at 22:22

wjandrea · Accepted Answer · 2025-08-22 00:55:28Z

4

A possible solution based on numpy:

w = np.lib.stride_tricks.sliding_window_view(df['D'], 3)
idx = np.flatnonzero((w == (0,1,2)).all(1)) # starting indexes of seq 0, 1, 2
df.iloc[(idx[:, None] + np.arange(3)).ravel()].reset_index(drop=True)

This uses numpy’s sliding_window_view to create a rolling 3-element view over the D column, then checks which windows match the sequence (0,1,2) by comparing element-wise and applying all along axis 1; the indices of the matching windows are obtained with flatnonzero. These starting indices are then expanded into full triplets with broadcasting, and the corresponding rows are selected from the dataframe using iloc, before reindexing cleanly with reset_index.

Output:

   A  B  C  D
0  1  2  3  0
1  4  5  6  1
2  7  8  9  2
3  1  2  3  0
4  4  5  6  1
5  7  8  8  2

Intermediates:

# w == (0,1,2)

array([[ True,  True,  True],
       [False, False, False],
       [False, False, False],
       [ True,  True, False],
       [False, False, False],
       [ True,  True,  True]])

# idx[:, None]

array([[0],
       [5]])

# + np.arange(3)

array([[0, 1, 2],
       [5, 6, 7]])

# .ravel()

array([0, 1, 2, 5, 6, 7])

To turn this solution more general

seq = (0,1,2)
n = len(seq)

then:

.sliding_window_view(..., n)
w == seq
np.arange(n)

(thanks @wjandrea)

edited Aug 22 at 0:55

wjandrea

34k10 gold badges69 silver badges105 bronze badges

answered Aug 21 at 23:38

PaulS

27.1k3 gold badges19 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

wjandrea Aug 22 at 0:21

This is really elegant! It's a shame the NumPy syntax is so ugly :p

PaulS Aug 22 at 0:33

True, thanks! Another approach that works is convolution, but I thought that would be much more complicated to explain, as one has to make sure the score of the convolution does identity the sequence 0, 1, 2 uniquely.

wjandrea Aug 22 at 1:52

I posted a more Pandas-idiomatic version of this here. Thanks for the inspiration!

wjandrea · Accepted Answer · 2025-08-21 22:11:04Z

1

For what it's worth, this solution works in this case, but not more generally:

You can check that rows in D fit the sequence by shifting:

mask = df['D'].pipe(lambda s:
    (s.eq(0) & s.shift(-1).eq(1) & s.shift(-2).eq(2)) |
    (s.eq(1) & s.shift(-1).eq(2)) |
    s.eq(2)
)
df[mask]

   A  B  C  D
0  1  2  3  0
1  4  5  6  1
2  7  8  9  2
5  1  2  3  0
6  4  5  6  1
7  7  8  8  2

Where it breaks

I think there are multiple ways, but the first I thought of is adding another row with a 2 at the end:

8   10  10  10  2

This code doesn't remove it, since | s.eq(2) isn't strict enough.

    A   B   C  D
0   1   2   3  0
1   4   5   6  1
2   7   8   9  2
5   1   2   3  0
6   4   5   6  1
7   7   8   8  2
8  10  10  10  2

answered Aug 21 at 22:11

wjandrea

34k10 gold badges69 silver badges105 bronze badges

1 Comment

wjandrea Aug 22 at 1:45

I posted a better answer here

wjandrea · Accepted Answer · 2025-08-22 12:57:50Z

1

.rolling, check if sequence matches

The expected sequence is 0, 1, 2, so you can create a rolling window of that size then check if each window matches. That gives you the ending points, and you just need to backfill to the size of the window.

seq = (0, 1, 2)
n = len(seq)
mask = (
    df['D'].rolling(n)
    .apply(lambda s: s.eq(seq).all())
    .astype('boolean')  # Nullable boolean, to allow backfilling
    .where(lambda s: s)  # False -> NA
    .bfill(limit=n-1)
    .fillna(False)  # NA -> False
)
df[mask]

   A  B  C  D
0  1  2  3  0
1  4  5  6  1
2  7  8  9  2
5  1  2  3  0
6  4  5  6  1
7  7  8  8  2

This is based on PaulS's answer, but using Pandas's fluent interface, and using bfill instead of broadcasting with arange(n).

If you need to optimize

Start by switching to NumPy in the apply:

.apply(lambda s: (s == seq).all(), raw=True)

For a df of 60,000 rows, this took it from 22s to 1s on my machine.

edited Aug 22 at 12:57

answered Aug 22 at 1:37

wjandrea

34k10 gold badges69 silver badges105 bronze badges

1 Comment

PaulS Aug 22 at 10:14

Nice solution and based exclusively on pandas!

wjandrea · Accepted Answer · 2025-08-22 13:10:24Z

You can check .shift(-n) == n for 0, 1, 2 to identify the start of each run and then shift forward to mark the remaining rows.

size = 3

mask = True
for n in range(size):
    mask &= df["D"].shift(-n) == n

for n in range(1, size):
    mask |= mask.shift()

df[mask]

   A  B  C  D
0  1  2  3  0
1  4  5  6  1
2  7  8  9  2
5  1  2  3  0
6  4  5  6  1
7  7  8  8  2

The first for loop will leave you with True for each 0 row in 0, 1, 2.

>>> mask_df = pd.concat([df["D"].shift(-n) == n for n in range(size)], axis=1)
>>> mask_df 
       D      D      D
0   True   True   True  # <-
1  False  False  False
2  False  False  False
3   True   True  False
4  False  False  False
5   True   True   True  # <-
6  False  False  False
7  False  False  False

>>> mask1 = mask_df.all(axis=1)
>>> mask1
0     True
1    False
2    False
3    False
4    False
5     True
6    False
7    False
dtype: bool

The second for loop shifts the True values forward for 1, 2.

>>> mask2 = pd.concat([mask1.shift(n) for n in range(size)], axis=1).any(axis=1)
>>> mask2
0     True
1     True
2     True
3    False
4    False
5     True
6     True
7     True
dtype: bool

Setup

df = pd.DataFrame({
    'A': [1, 4, 7, 10, 10, 1, 4, 7],
    'B': [2, 5, 8, 10, 10, 2, 5, 8],
    'C': [3, 6, 9, 10, 10, 3, 6, 8],
    'D': [0, 1, 2, 0, 1, 0, 1, 2]})

wjandrea · Accepted Answer · 2025-08-23 13:36:16Z

0

add mask for your data, filter out mask and apply mask:

# pattern to filter out
pattern = [0, 1, 2]
pattern_length = len(pattern)

# Create a mask to identify valid sequences
mask = pd.Series(False, index=df.index)

for i in range(len(df) - pattern_length + 1):
    # Check if the current window matches our pattern
    if all(df['D'].iloc[i + j] == pattern[j] for j in range(pattern_length)):
        mask.iloc[i : i + pattern_length] = True

# Apply the mask to filter the dataframe
result = df[mask].reset_index(drop=True)

print(result)

edited Aug 23 at 13:36

wjandrea

34k10 gold badges69 silver badges105 bronze badges

answered Aug 23 at 12:15

Mohsen

1,0892 gold badges9 silver badges24 bronze badges

2 Comments

wjandrea Aug 23 at 13:41

You don't need to loop over range(pattern_length), just use slicing: df['D'].iloc[i : i+pattern_length].eq(pattern).all(). I haven't tested this; you might need to convert pattern to tuple for it to work.

wjandrea Aug 23 at 13:43

This is quite similar to my answer that uses .rolling, but using more vanilla Python idioms.

Triky · Accepted Answer · 2025-08-23 15:38:25Z

0

Since you have a specific order, and nothing randomized, we can use groupby to create groups based on your specifications and check if the group that are equal to your order.

First we start by creating groups based on the first value (D == 0 in this case) and any value greater than the last value of your order (D > 2) to ensure that the numbers we don't need won't affect the next step.

After that we groupby and convert every group to tuples or lists and we check if each set is equal with your order.

At the end, filter the groups that are not equal to your specific order, and create a mask with the ones that are equal to filter your dataframe.

order = (0, 1, 2)
g = (df['D'].eq(0) | df['D'].gt(2)).cumsum()
m = df['D'].groupby(g).apply(tuple) == order
result = df[g.isin(m[m].index)]

End result:

Note: This solution won't work for sequences that contain numbers that are bigger than the numbers that are outside of your sequence.

Example: Sequence 1, 5, 4 won't work if the numbers outside of it contains numbers that are smaller than the numbers in your sequence (0, 2, 3)

It will also not work for sequences that contain duplicate numbers. Example: 0, 0, 1, 2, 3

If that is case, your only option is to use the Pandas or Numpy rolling.

edited Aug 23 at 15:38

answered Aug 22 at 14:48

Triky

7541 gold badge4 silver badges5 bronze badges

3 Comments

wjandrea Aug 23 at 13:54

Why are you using set? That ignores duplicates and order, so for example, 0 2 2 1 is allowed, but shouldn't be.

wjandrea Aug 23 at 14:13

This also allows 0 1 2 2, but from what I understood, only the first part, 0 1 2 should be selected.

Triky Aug 23 at 15:30

You're right. I forgot to mention that there are cases that my code can't handle. Thank you for pointing that out.

Collectives™ on Stack Overflow

removing rows that don't fit the repeating sequence in pandas dataframe

6 Answers 6

To turn this solution more general

3 Comments

Where it breaks

1 Comment

.rolling, check if sequence matches

If you need to optimize

1 Comment

Setup

Comments

2 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

To turn this solution more general

3 Comments

Where it breaks

1 Comment

.rolling, check if sequence matches

If you need to optimize

1 Comment

Setup

Comments

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related