2

I'm looking for a faster method of applying values to a column in a DataFrame. The value is based on two True and False values in the first and second column. This is my current solution:

df['result'] = df.check1.astype(int)

for i in range(len(df)):
    if df.result[i] != 1:
        df.result[i] = df.result.shift(1)[i] + df.check2[i].astype(int)

Which yields this result:

    check1  check2  result
0   True    False   1
1   False   False   1
2   False   False   1
3   False   False   1
4   False   False   1
5   False   False   1
6   False   True    2
7   False   False   2
8   False   True    3
9   False   False   3
10  False   True    4
11  False   False   4
12  False   True    5
13  False   False   5
14  False   True    6
15  False   False   6
16  False   True    7
17  False   False   7
18  False   False   7
19  False   False   7
20  False   True    8
21  False   False   8
22  False   True    9
23  True    False   1
24  False   False   1

So the third column needs to be a number based on the value in the row above it. If check1 is True the number needs to go back to 1. If check2 is true, 1 needs to be added to the number. Otherwise the number stays the same.

The current code is fine but it's taking too long as I need to apply this to a DataFrame with approx. 70.000 rows. I'm pretty sure it can be improved (I'm guessing using the apply function but I'm not sure).
Any ideas?

2 Answers 2

3

Use pandas.DataFrame.groupby.cumsum:

import pandas as pd

df['result'] = df.groupby(df['check1'].cumsum())[['check1', 'check2']].cumsum().sum(1)

Or @Dan's suggestion:

df['result'] = df.groupby(df['check1'].cumsum())['check2'].cumsum().add(1)

Output:

    check1  check2  result
0     True   False     1.0
1    False   False     1.0
2    False   False     1.0
3    False   False     1.0
4    False   False     1.0
5    False   False     1.0
6    False    True     2.0
7    False   False     2.0
8    False    True     3.0
9    False   False     3.0
10   False    True     4.0
11   False   False     4.0
12   False    True     5.0
13   False   False     5.0
14   False    True     6.0
15   False   False     6.0
16   False    True     7.0
17   False   False     7.0
18   False   False     7.0
19   False   False     7.0
20   False    True     8.0
21   False   False     8.0
22   False    True     9.0
23    True   False     1.0
24   False   False     1.0
Sign up to request clarification or add additional context in comments.

1 Comment

@Dan Ah, sounds even better :) Updated the post
0

You want to iterate a dataframe using the value of the preceding row. In that case, the most efficient way is to directly iterate the underlying numpy arrays:

df = pd.read_fwf(io.StringIO(t))

df['result'] = df.check1.astype(int)

res = df['result'].values
c1 = df['check1'].values
c2 = df['check2'].values
old = -1
for i in range(len(df)):
    if res[i] != 1:
        res[i] = old + int(c2[i])
    old = res[i]

This works fine because numpy arrays are mutable types, so the changes are reflected in the dataframe.

Timeit says that this is twice as fast as the original solution from @Chris's, and still 1.5 times faster after @Dan's improvement.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.