Efficiently get first indices of consecutive identical digits in big pandas DataFrames

Question

I have a DataFrame with a column Digit of digits at base 10. For example

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "Digit": [
        1, 3, 5, 7, 0, 0, 0,
        4, 8, 9, 7, 7, 7, 7, 
        9, 3, 3, 1, 6, 8, 0,
        8, 8, 8, 8, 8, 3, 1,
    ]
})

I want to identify the indices of the 1st element wherever there are at least k consecutive identical digits. So in the case of this df and k=2, I want to obtain indices [4, 10, 15, 21]; for k=3, it should be [4, 10, 21], and so on.

I thought to do it like this: first of all, create a column for each value of k so that it is equal to 1 if Digit is identical to the next k digits

for i in np.arange(2, 5+1):
    df[f"k{i}"] = np.prod(
        [df.Digit == df.Digit.shift(-j) for j in np.arange(1, i)], 
        axis=0
    )

    Digit  k2  k3  k4  k5
0       1   0   0   0   0
1       3   0   0   0   0
2       5   0   0   0   0
3       7   0   0   0   0
4       0   1   1   0   0
5       0   1   0   0   0
6       0   0   0   0   0
7       4   0   0   0   0
8       8   0   0   0   0
9       9   0   0   0   0
10      7   1   1   1   0
11      7   1   1   0   0
12      7   1   0   0   0
13      7   0   0   0   0
14      9   0   0   0   0
15      3   1   0   0   0
16      3   0   0   0   0
17      1   0   0   0   0
18      6   0   0   0   0
19      8   0   0   0   0
20      0   0   0   0   0
21      8   1   1   1   1
22      8   1   1   1   0
23      8   1   1   0   0
24      8   1   0   0   0
25      8   0   0   0   0
26      3   0   0   0   0
27      1   0   0   0   0

Then, "remove" all consecutive ones in the k columns to get the indices of 1st elements only

for i in np.arange(2, 5+1):
    df[f"k{i}"] = (df[f"k{i}"] == 1) & (df[f"k{i}"] != df[f"k{i}"].shift(1))

    Digit     k2     k3     k4     k5
0       1  False  False  False  False
1       3  False  False  False  False
2       5  False  False  False  False
3       7  False  False  False  False
4       0   True   True  False  False
5       0  False  False  False  False
6       0  False  False  False  False
7       4  False  False  False  False
8       8  False  False  False  False
9       9  False  False  False  False
10      7   True   True   True  False
11      7  False  False  False  False
12      7  False  False  False  False
13      7  False  False  False  False
14      9  False  False  False  False
15      3   True  False  False  False
16      3  False  False  False  False
17      1  False  False  False  False
18      6  False  False  False  False
19      8  False  False  False  False
20      0  False  False  False  False
21      8   True   True   True   True
22      8  False  False  False  False
23      8  False  False  False  False
24      8  False  False  False  False
25      8  False  False  False  False
26      3  False  False  False  False
27      1  False  False  False  False

This does exactly what I need, but it is inefficient with "big" DataFrames. I'll need to apply it to DataFrames with billions of entries. For example

def index_of_identical_consecutive_digits(_df, colname, max_id=6):
    for i in np.arange(2, max_id+1):
        _df[f"k{i}"] = np.prod(
            [_df[colname] == _df[colname].shift(-j) for j in np.arange(1, i)], 
            axis=0
        )
    for i in np.arange(2, max_id+1):
        _df[f"k{i}"] = (_df[f"k{i}"] == 1) & (_df[f"k{i}"] != _df[f"k{i}"].shift(1))
    return _df

np.random.seed(0)
DF = pd.DataFrame({
    "Digit": np.random.randint(10, size=1_000_000_000)
})

DF = index_of_identical_consecutive_digits(DF, "Digit")

took 10 minutes on a MacBookPro17,1. Any suggestions on how to improve it?

You want all values of k to be evaluated in a single call, or will you pass k to the function and get back a result just for that value? I see your current function does a sweep of values but I'm curious whether that's what you ultimately want as I'll have to rethink my approach that came to mind — roganjosh
– roganjosh, Commented Nov 17 at 21:43
@roganjosh I'd like to do it in a single call, from k=2 to a maximum, usually not greater than 9, because I'll need all of them. However, if there's a more efficient way to do it with separate calls, it would be great, as long as it yields all the results. — Max Pierini
– Max Pierini, Commented Nov 17 at 21:55
Instead of separate columns for each k value, how about a single column that contains the number of consecutive repeats? Then you just have to filter to repeats>=k. — Barmar
– Barmar, Commented Nov 17 at 22:30
FWIW I can't beat your timings. I tried multiple different approaches and then just defaulted back to extending the canonical for this and you still win on that basis - here — roganjosh
– roganjosh, Commented Nov 17 at 23:20
Can you time Polars? out = pl.from_pandas(DF).lazy().with_columns(pl.all_horizontal(pl.col("Digit") != pl.col("Digit").shift(), *(pl.col("Digit") == pl.col("Digit").shift(-n) for n in range(1, k))).fill_null(False).alias(f"k{k}") for k in range(2, 7)).collect(engine="streaming").to_pandas() (although using Polars directly would be "faster") — jqurious
– jqurious, Commented Nov 18 at 0:28

mozway · Accepted Answer · 2025-11-18 02:40:35Z

I you want to compute several such columns, an efficient option would be to get the number of consecutive values, then broadcast the comparison to multiple ks.

You also want to avoid slow pandas operations (like groupby that is similar to a python loop), especially if you have short stretches of numbers, and also avoid any explicit loop. One option could be to use numpy to get the indices of the starts, then np.diff for the length of the runs, and then broadcast the comparison:

# identify starting points of consecutive values
m = df['Digit'].ne(df['Digit'].shift())
# make an array of indices
idx = np.arange(len(df))[m]
# compute the size of runs
size = np.diff(np.r_[idx, len(df)])

# broadcast the comparison to multiple ks
k = 5
ks = np.arange(2, k+1)
df[ks] = False
df.loc[m, ks] = size[:, None] >= ks

Output:

    Digit      2      3      4      5
0       1  False  False  False  False
1       3  False  False  False  False
2       5  False  False  False  False
3       7  False  False  False  False
4       0   True   True  False  False
5       0  False  False  False  False
6       0  False  False  False  False
7       4  False  False  False  False
8       8  False  False  False  False
9       9  False  False  False  False
10      7   True   True   True  False
11      7  False  False  False  False
12      7  False  False  False  False
13      7  False  False  False  False
14      9  False  False  False  False
15      3   True  False  False  False
16      3  False  False  False  False
17      1  False  False  False  False
18      6  False  False  False  False
19      8  False  False  False  False
20      0  False  False  False  False
21      8   True   True   True   True
22      8  False  False  False  False
23      8  False  False  False  False
24      8  False  False  False  False
25      8  False  False  False  False
26      3  False  False  False  False
27      1  False  False  False  False

Your 1B rows example runs in ~1 min on my machine (for comparison, creating the random DataFrame took 38s).

Here is a comparison over multiple lengths (k = 5):

Intermediate arrays:

And, if you want actual ki names, change the last few rows to:

names = 'k' + pd.Index(ks).astype(str)
df[names] = False
df.loc[m, names] = size[:, None] >= ks

Output:

    Digit     k2     k3     k4     k5
0       1  False  False  False  False
1       3  False  False  False  False
2       5  False  False  False  False
3       7  False  False  False  False
4       0   True   True  False  False
5       0  False  False  False  False
...

optimization

if you have a lot of single values, or want to set a minimal k, you could improve this approach further by avoiding to broadcast the comparison for rows in which k = 1:

m = df['Digit'].ne(df['Digit'].shift())
idx = np.arange(len(df))[m]
size = np.diff(np.r_[idx, len(df)])

m2 = size>1        # only consider
idx = idx[m2]      # rows for which
size = size[m2]    # k > 1

ks = np.arange(2, k+1)

names = 'k' + pd.Index(ks).astype(str)
df[names] = False
df.loc[df.index[idx], names] = size[:, None] >= ks # this is also updated

Comparison (k = 5), the optimized approach is labeled "numpy2":

Optimized intermediate:

pixel-process · Accepted Answer · 2025-11-18 00:17:20Z

-5

Avoid recursion and iteration to speed up your code. I suggest breaking it into 1) identifying indices of switches, 2) using the cumsum function to count within instances, 3) grouping the final totals.

Full Code (explanation below)

import pandas as pd
df = pd.DataFrame({
    "Digit": [
        1, 3, 5, 7, 0, 0, 0,
        4, 8, 9, 7, 7, 7, 7, 
        9, 3, 3, 1, 6, 8, 0,
        8, 8, 8, 8, 8, 3, 1,
    ]
})

df['Shifted'] = df['Digit'].shift()

df['Switched'] = df['Digit'] != df['Shifted']

df['CumulativeSum'] = df['Switched'].cumsum()

results = df.groupby('CumulativeSum')['Digit'].agg(['mean', 'idxmin', 'size'])

This allows quick assessment of total switches and how many repeats are in each without iteration, making it faster especially on larger datasets.

Below table has results of provided test data where CumlativeSum represents each switch, mean is digit, idxmin is the index of the initial row, and size is the count of the repeated value.

CumulativeSum	mean	idxmin	size
1	1	0	1
2	3	1	1
3	5	2	1
4	7	3	1
5	0	4	3
6	4	7	1
7	8	8	1
8	9	9	1
9	7	10	4
10	9	14	1
11	3	15	2
12	1	17	1
13	6	18	1
14	8	19	1
15	0	20	1
16	8	21	5
17	3	26	1
18	1	27	1

Explanation

Start by creating a additional columns:

shift: creates new column with each value moved one index down df['Shifted'] = df['Digit'].shift()
switch: create a boolean column that marks each row as True if a switch from previous, False otherwise df['Switched'] = df['Digit'] != df['Shifted']
cumsum: When applied to switch column will create an incremental count of the number of True values to that point, effectively creating a grouping indicator for each switch in Digits df['CumulativeSum'] = df['Switched'].cumsum()

Result is a df like below:

	Digit	Shifted	MatchesAbove	Switched	CumulativeSum
0	1	nan	True	True	1
1	3	1	True	True	2
2	5	3	True	True	3
3	7	5	True	True	4
4	0	7	True	True	5
5	0	0	False	False	5
6	0	0	False	False	5
7	4	0	True	True	6
8	8	4	True	True	7
9	9	8	True	True	8

Applying groupby functions to this can return data on the number of switches (max cumsum), the value (mean), starting index (idxmin), and number of repetitions (size).

results = df.groupby('CumulativeSum')['Digit'].agg(['mean', 'idxmin', 'size'])

CumulativeSum	mean	idxmin	size
1	1	0	1
2	3	1	1
3	5	2	1
4	7	3	1
5	0	4	3
6	4	7	1
7	8	8	1
8	9	9	1
9	7	10	4
10	9	14	1
11	3	15	2
12	1	17	1
13	6	18	1
14	8	19	1
15	0	20	1
16	8	21	5
17	3	26	1
18	1	27	1

edited Nov 18 at 0:17

answered Nov 17 at 23:34

pixel-process

4751 silver badge7 bronze badges

3 Comments

roganjosh Nov 17 at 23:45

And the reason you think AI garbage here is helpful is because...? I mean, what part of the OP's code suggests that you need to explain what .shift() does? Oh wait, you didn't. ChatGPT did.

pixel-process Nov 18 at 0:16

For performance and scaling up, non-iterative solutions tend to work better in my experience, but perhaps I misunderstood the ask. Additional references may be helpful to individuals besides OP.

Henry Ecker Nov 18 at 1:51

This doesn't seem like it generates the correct output, neither the table with K values in their own column nor lists of indexes for each K. I'm not sure why you're adding 3 unneeded additional columns for each step of the standard grouping process that seems like it would be really inefficient especially at 1 billion rows. Also taking the time to calculate the mean for groups for no reason is again an odd choice for this optimisation problem...

Collectives™ on Stack Overflow

Efficiently get first indices of consecutive identical digits in big pandas DataFrames

2 Answers 2

optimization

Comments

Full Code (explanation below)

Explanation

3 Comments

Your Answer

Linked

Hot Network Questions

CumulativeSum	mean	idxmin	size
1	1	0	1
2	3	1	1
3	5	2	1
4	7	3	1
5	0	4	3
6	4	7	1
7	8	8	1
8	9	9	1
9	7	10	4
10	9	14	1
11	3	15	2
12	1	17	1
13	6	18	1
14	8	19	1
15	0	20	1
16	8	21	5
17	3	26	1
18	1	27	1

CumulativeSum	mean	idxmin	size
1	1	0	1
2	3	1	1
3	5	2	1
4	7	3	1
5	0	4	3
6	4	7	1
7	8	8	1
8	9	9	1
9	7	10	4
10	9	14	1
11	3	15	2
12	1	17	1
13	6	18	1
14	8	19	1
15	0	20	1
16	8	21	5
17	3	26	1
18	1	27	1

CumulativeSum	mean	idxmin	size
1	1	0	1
2	3	1	1
3	5	2	1
4	7	3	1
5	0	4	3
6	4	7	1
7	8	8	1
8	9	9	1
9	7	10	4
10	9	14	1
11	3	15	2
12	1	17	1
13	6	18	1
14	8	19	1
15	0	20	1
16	8	21	5
17	3	26	1
18	1	27	1

CumulativeSum	mean	idxmin	size
1	1	0	1
2	3	1	1
3	5	2	1
4	7	3	1
5	0	4	3
6	4	7	1
7	8	8	1
8	9	9	1
9	7	10	4
10	9	14	1
11	3	15	2
12	1	17	1
13	6	18	1
14	8	19	1
15	0	20	1
16	8	21	5
17	3	26	1
18	1	27	1

Collectives™ on Stack Overflow

2 Answers 2

optimization

Comments

Full Code (explanation below)

Explanation

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

CumulativeSum	mean	idxmin	size
1	1	0	1
2	3	1	1
3	5	2	1
4	7	3	1
5	0	4	3
6	4	7	1
7	8	8	1
8	9	9	1
9	7	10	4
10	9	14	1
11	3	15	2
12	1	17	1
13	6	18	1
14	8	19	1
15	0	20	1
16	8	21	5
17	3	26	1
18	1	27	1

CumulativeSum	mean	idxmin	size
1	1	0	1
2	3	1	1
3	5	2	1
4	7	3	1
5	0	4	3
6	4	7	1
7	8	8	1
8	9	9	1
9	7	10	4
10	9	14	1
11	3	15	2
12	1	17	1
13	6	18	1
14	8	19	1
15	0	20	1
16	8	21	5
17	3	26	1
18	1	27	1