I have a DataFrame with a column Digit of digits at base 10. For example
import numpy as np
import pandas as pd
df = pd.DataFrame({
"Digit": [
1, 3, 5, 7, 0, 0, 0,
4, 8, 9, 7, 7, 7, 7,
9, 3, 3, 1, 6, 8, 0,
8, 8, 8, 8, 8, 3, 1,
]
})
Digit
0 1
1 3
2 5
3 7
4 0
5 0
6 0
7 4
8 8
9 9
10 7
11 7
12 7
13 7
14 9
15 3
16 3
17 1
18 6
19 8
20 0
21 8
22 8
23 8
24 8
25 8
26 3
27 1
I want to identify the indices of the 1st element wherever there are at least k consecutive identical digits. So in the case of this df and k=2, I want to obtain indices [4, 10, 15, 21]; for k=3, it should be [4, 10, 21], and so on.
I thought to do it like this: first of all, create a column for each value of k so that it is equal to 1 if Digit is identical to the next k digits
for i in np.arange(2, 5+1):
df[f"k{i}"] = np.prod(
[df.Digit == df.Digit.shift(-j) for j in np.arange(1, i)],
axis=0
)
Digit k2 k3 k4 k5
0 1 0 0 0 0
1 3 0 0 0 0
2 5 0 0 0 0
3 7 0 0 0 0
4 0 1 1 0 0
5 0 1 0 0 0
6 0 0 0 0 0
7 4 0 0 0 0
8 8 0 0 0 0
9 9 0 0 0 0
10 7 1 1 1 0
11 7 1 1 0 0
12 7 1 0 0 0
13 7 0 0 0 0
14 9 0 0 0 0
15 3 1 0 0 0
16 3 0 0 0 0
17 1 0 0 0 0
18 6 0 0 0 0
19 8 0 0 0 0
20 0 0 0 0 0
21 8 1 1 1 1
22 8 1 1 1 0
23 8 1 1 0 0
24 8 1 0 0 0
25 8 0 0 0 0
26 3 0 0 0 0
27 1 0 0 0 0
Then, "remove" all consecutive ones in the k columns to get the indices of 1st elements only
for i in np.arange(2, 5+1):
df[f"k{i}"] = (df[f"k{i}"] == 1) & (df[f"k{i}"] != df[f"k{i}"].shift(1))
Digit k2 k3 k4 k5
0 1 False False False False
1 3 False False False False
2 5 False False False False
3 7 False False False False
4 0 True True False False
5 0 False False False False
6 0 False False False False
7 4 False False False False
8 8 False False False False
9 9 False False False False
10 7 True True True False
11 7 False False False False
12 7 False False False False
13 7 False False False False
14 9 False False False False
15 3 True False False False
16 3 False False False False
17 1 False False False False
18 6 False False False False
19 8 False False False False
20 0 False False False False
21 8 True True True True
22 8 False False False False
23 8 False False False False
24 8 False False False False
25 8 False False False False
26 3 False False False False
27 1 False False False False
This does exactly what I need, but it is inefficient with "big" DataFrames. I'll need to apply it to DataFrames with billions of entries. For example
def index_of_identical_consecutive_digits(_df, colname, max_id=6):
for i in np.arange(2, max_id+1):
_df[f"k{i}"] = np.prod(
[_df[colname] == _df[colname].shift(-j) for j in np.arange(1, i)],
axis=0
)
for i in np.arange(2, max_id+1):
_df[f"k{i}"] = (_df[f"k{i}"] == 1) & (_df[f"k{i}"] != _df[f"k{i}"].shift(1))
return _df
np.random.seed(0)
DF = pd.DataFrame({
"Digit": np.random.randint(10, size=1_000_000_000)
})
DF = index_of_identical_consecutive_digits(DF, "Digit")
took 10 minutes on a MacBookPro17,1. Any suggestions on how to improve it?


k=2to a maximum, usually not greater than9, because I'll need all of them. However, if there's a more efficient way to do it with separate calls, it would be great, as long as it yields all the results.repeats>=k.out = pl.from_pandas(DF).lazy().with_columns(pl.all_horizontal(pl.col("Digit") != pl.col("Digit").shift(), *(pl.col("Digit") == pl.col("Digit").shift(-n) for n in range(1, k))).fill_null(False).alias(f"k{k}") for k in range(2, 7)).collect(engine="streaming").to_pandas()(although using Polars directly would be "faster")