3

I have a DataFrame with a column Digit of digits at base 10. For example

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "Digit": [
        1, 3, 5, 7, 0, 0, 0,
        4, 8, 9, 7, 7, 7, 7, 
        9, 3, 3, 1, 6, 8, 0,
        8, 8, 8, 8, 8, 3, 1,
    ]
})
    Digit
0       1
1       3
2       5
3       7
4       0
5       0
6       0
7       4
8       8
9       9
10      7
11      7
12      7
13      7
14      9
15      3
16      3
17      1
18      6
19      8
20      0
21      8
22      8
23      8
24      8
25      8
26      3
27      1

I want to identify the indices of the 1st element wherever there are at least k consecutive identical digits. So in the case of this df and k=2, I want to obtain indices [4, 10, 15, 21]; for k=3, it should be [4, 10, 21], and so on.

I thought to do it like this: first of all, create a column for each value of k so that it is equal to 1 if Digit is identical to the next k digits

for i in np.arange(2, 5+1):
    df[f"k{i}"] = np.prod(
        [df.Digit == df.Digit.shift(-j) for j in np.arange(1, i)], 
        axis=0
    )
    Digit  k2  k3  k4  k5
0       1   0   0   0   0
1       3   0   0   0   0
2       5   0   0   0   0
3       7   0   0   0   0
4       0   1   1   0   0
5       0   1   0   0   0
6       0   0   0   0   0
7       4   0   0   0   0
8       8   0   0   0   0
9       9   0   0   0   0
10      7   1   1   1   0
11      7   1   1   0   0
12      7   1   0   0   0
13      7   0   0   0   0
14      9   0   0   0   0
15      3   1   0   0   0
16      3   0   0   0   0
17      1   0   0   0   0
18      6   0   0   0   0
19      8   0   0   0   0
20      0   0   0   0   0
21      8   1   1   1   1
22      8   1   1   1   0
23      8   1   1   0   0
24      8   1   0   0   0
25      8   0   0   0   0
26      3   0   0   0   0
27      1   0   0   0   0

Then, "remove" all consecutive ones in the k columns to get the indices of 1st elements only

for i in np.arange(2, 5+1):
    df[f"k{i}"] = (df[f"k{i}"] == 1) & (df[f"k{i}"] != df[f"k{i}"].shift(1))
    Digit     k2     k3     k4     k5
0       1  False  False  False  False
1       3  False  False  False  False
2       5  False  False  False  False
3       7  False  False  False  False
4       0   True   True  False  False
5       0  False  False  False  False
6       0  False  False  False  False
7       4  False  False  False  False
8       8  False  False  False  False
9       9  False  False  False  False
10      7   True   True   True  False
11      7  False  False  False  False
12      7  False  False  False  False
13      7  False  False  False  False
14      9  False  False  False  False
15      3   True  False  False  False
16      3  False  False  False  False
17      1  False  False  False  False
18      6  False  False  False  False
19      8  False  False  False  False
20      0  False  False  False  False
21      8   True   True   True   True
22      8  False  False  False  False
23      8  False  False  False  False
24      8  False  False  False  False
25      8  False  False  False  False
26      3  False  False  False  False
27      1  False  False  False  False

This does exactly what I need, but it is inefficient with "big" DataFrames. I'll need to apply it to DataFrames with billions of entries. For example

def index_of_identical_consecutive_digits(_df, colname, max_id=6):
    for i in np.arange(2, max_id+1):
        _df[f"k{i}"] = np.prod(
            [_df[colname] == _df[colname].shift(-j) for j in np.arange(1, i)], 
            axis=0
        )
    for i in np.arange(2, max_id+1):
        _df[f"k{i}"] = (_df[f"k{i}"] == 1) & (_df[f"k{i}"] != _df[f"k{i}"].shift(1))
    return _df

np.random.seed(0)
DF = pd.DataFrame({
    "Digit": np.random.randint(10, size=1_000_000_000)
})

DF = index_of_identical_consecutive_digits(DF, "Digit")

took 10 minutes on a MacBookPro17,1. Any suggestions on how to improve it?

8
  • You want all values of k to be evaluated in a single call, or will you pass k to the function and get back a result just for that value? I see your current function does a sweep of values but I'm curious whether that's what you ultimately want as I'll have to rethink my approach that came to mind Commented Nov 17 at 21:43
  • @roganjosh I'd like to do it in a single call, from k=2 to a maximum, usually not greater than 9, because I'll need all of them. However, if there's a more efficient way to do it with separate calls, it would be great, as long as it yields all the results. Commented Nov 17 at 21:55
  • Instead of separate columns for each k value, how about a single column that contains the number of consecutive repeats? Then you just have to filter to repeats>=k. Commented Nov 17 at 22:30
  • 2
    FWIW I can't beat your timings. I tried multiple different approaches and then just defaulted back to extending the canonical for this and you still win on that basis - here Commented Nov 17 at 23:20
  • 2
    Can you time Polars? out = pl.from_pandas(DF).lazy().with_columns(pl.all_horizontal(pl.col("Digit") != pl.col("Digit").shift(), *(pl.col("Digit") == pl.col("Digit").shift(-n) for n in range(1, k))).fill_null(False).alias(f"k{k}") for k in range(2, 7)).collect(engine="streaming").to_pandas() (although using Polars directly would be "faster") Commented Nov 18 at 0:28

2 Answers 2

3

I you want to compute several such columns, an efficient option would be to get the number of consecutive values, then broadcast the comparison to multiple ks.

You also want to avoid slow pandas operations (like groupby that is similar to a python loop), especially if you have short stretches of numbers, and also avoid any explicit loop. One option could be to use to get the indices of the starts, then np.diff for the length of the runs, and then broadcast the comparison:

# identify starting points of consecutive values
m = df['Digit'].ne(df['Digit'].shift())
# make an array of indices
idx = np.arange(len(df))[m]
# compute the size of runs
size = np.diff(np.r_[idx, len(df)])

# broadcast the comparison to multiple ks
k = 5
ks = np.arange(2, k+1)
df[ks] = False
df.loc[m, ks] = size[:, None] >= ks

Output:

    Digit      2      3      4      5
0       1  False  False  False  False
1       3  False  False  False  False
2       5  False  False  False  False
3       7  False  False  False  False
4       0   True   True  False  False
5       0  False  False  False  False
6       0  False  False  False  False
7       4  False  False  False  False
8       8  False  False  False  False
9       9  False  False  False  False
10      7   True   True   True  False
11      7  False  False  False  False
12      7  False  False  False  False
13      7  False  False  False  False
14      9  False  False  False  False
15      3   True  False  False  False
16      3  False  False  False  False
17      1  False  False  False  False
18      6  False  False  False  False
19      8  False  False  False  False
20      0  False  False  False  False
21      8   True   True   True   True
22      8  False  False  False  False
23      8  False  False  False  False
24      8  False  False  False  False
25      8  False  False  False  False
26      3  False  False  False  False
27      1  False  False  False  False

Your 1B rows example runs in ~1 min on my machine (for comparison, creating the random DataFrame took 38s).

Here is a comparison over multiple lengths (k = 5):

numpy broadcasting consecutive values length

Intermediate arrays:

 idx  size
   0     1
   1     1
   2     1
   3     1
   4     3
   7     1
   8     1
   9     1
  10     4
  14     1
  15     2
  17     1
  18     1
  19     1
  20     1
  21     5
  26     1
  27     1

And, if you want actual ki names, change the last few rows to:

names = 'k' + pd.Index(ks).astype(str)
df[names] = False
df.loc[m, names] = size[:, None] >= ks

Output:

    Digit     k2     k3     k4     k5
0       1  False  False  False  False
1       3  False  False  False  False
2       5  False  False  False  False
3       7  False  False  False  False
4       0   True   True  False  False
5       0  False  False  False  False
...

optimization

if you have a lot of single values, or want to set a minimal k, you could improve this approach further by avoiding to broadcast the comparison for rows in which k = 1:

m = df['Digit'].ne(df['Digit'].shift())
idx = np.arange(len(df))[m]
size = np.diff(np.r_[idx, len(df)])

m2 = size>1        # only consider
idx = idx[m2]      # rows for which
size = size[m2]    # k > 1

ks = np.arange(2, k+1)

names = 'k' + pd.Index(ks).astype(str)
df[names] = False
df.loc[df.index[idx], names] = size[:, None] >= ks # this is also updated

Comparison (k = 5), the optimized approach is labeled "numpy2":

numpy broadcasting consecutive values length

Optimized intermediate:

 idx  size
   4     3
  10     4
  15     2
  21     5
Sign up to request clarification or add additional context in comments.

Comments

-5

Avoid recursion and iteration to speed up your code. I suggest breaking it into 1) identifying indices of switches, 2) using the cumsum function to count within instances, 3) grouping the final totals.

Full Code (explanation below)

import pandas as pd
df = pd.DataFrame({
    "Digit": [
        1, 3, 5, 7, 0, 0, 0,
        4, 8, 9, 7, 7, 7, 7, 
        9, 3, 3, 1, 6, 8, 0,
        8, 8, 8, 8, 8, 3, 1,
    ]
})

df['Shifted'] = df['Digit'].shift()

df['Switched'] = df['Digit'] != df['Shifted']

df['CumulativeSum'] = df['Switched'].cumsum()

results = df.groupby('CumulativeSum')['Digit'].agg(['mean', 'idxmin', 'size'])

This allows quick assessment of total switches and how many repeats are in each without iteration, making it faster especially on larger datasets.

Below table has results of provided test data where CumlativeSum represents each switch, mean is digit, idxmin is the index of the initial row, and size is the count of the repeated value.

CumulativeSum mean idxmin size
1 1 0 1
2 3 1 1
3 5 2 1
4 7 3 1
5 0 4 3
6 4 7 1
7 8 8 1
8 9 9 1
9 7 10 4
10 9 14 1
11 3 15 2
12 1 17 1
13 6 18 1
14 8 19 1
15 0 20 1
16 8 21 5
17 3 26 1
18 1 27 1

Explanation

Start by creating a additional columns:

  • shift: creates new column with each value moved one index down df['Shifted'] = df['Digit'].shift()

  • switch: create a boolean column that marks each row as True if a switch from previous, False otherwise df['Switched'] = df['Digit'] != df['Shifted']

  • cumsum: When applied to switch column will create an incremental count of the number of True values to that point, effectively creating a grouping indicator for each switch in Digits df['CumulativeSum'] = df['Switched'].cumsum()

Result is a df like below:

Digit Shifted MatchesAbove Switched CumulativeSum
0 1 nan True True 1
1 3 1 True True 2
2 5 3 True True 3
3 7 5 True True 4
4 0 7 True True 5
5 0 0 False False 5
6 0 0 False False 5
7 4 0 True True 6
8 8 4 True True 7
9 9 8 True True 8

Applying groupby functions to this can return data on the number of switches (max cumsum), the value (mean), starting index (idxmin), and number of repetitions (size).

results = df.groupby('CumulativeSum')['Digit'].agg(['mean', 'idxmin', 'size'])

CumulativeSum mean idxmin size
1 1 0 1
2 3 1 1
3 5 2 1
4 7 3 1
5 0 4 3
6 4 7 1
7 8 8 1
8 9 9 1
9 7 10 4
10 9 14 1
11 3 15 2
12 1 17 1
13 6 18 1
14 8 19 1
15 0 20 1
16 8 21 5
17 3 26 1
18 1 27 1

3 Comments

And the reason you think AI garbage here is helpful is because...? I mean, what part of the OP's code suggests that you need to explain what .shift() does? Oh wait, you didn't. ChatGPT did.
For performance and scaling up, non-iterative solutions tend to work better in my experience, but perhaps I misunderstood the ask. Additional references may be helpful to individuals besides OP.
This doesn't seem like it generates the correct output, neither the table with K values in their own column nor lists of indexes for each K. I'm not sure why you're adding 3 unneeded additional columns for each step of the standard grouping process that seems like it would be really inefficient especially at 1 billion rows. Also taking the time to calculate the mean for groups for no reason is again an odd choice for this optimisation problem...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.