Applying values to a DataFrame without using a for-loop

Question

I'm looking for a faster method of applying values to a column in a DataFrame. The value is based on two True and False values in the first and second column. This is my current solution:

df['result'] = df.check1.astype(int)

for i in range(len(df)):
    if df.result[i] != 1:
        df.result[i] = df.result.shift(1)[i] + df.check2[i].astype(int)

Which yields this result:

    check1  check2  result
0   True    False   1
1   False   False   1
2   False   False   1
3   False   False   1
4   False   False   1
5   False   False   1
6   False   True    2
7   False   False   2
8   False   True    3
9   False   False   3
10  False   True    4
11  False   False   4
12  False   True    5
13  False   False   5
14  False   True    6
15  False   False   6
16  False   True    7
17  False   False   7
18  False   False   7
19  False   False   7
20  False   True    8
21  False   False   8
22  False   True    9
23  True    False   1
24  False   False   1

So the third column needs to be a number based on the value in the row above it. If check1 is True the number needs to go back to 1. If check2 is true, 1 needs to be added to the number. Otherwise the number stays the same.

The current code is fine but it's taking too long as I need to apply this to a DataFrame with approx. 70.000 rows. I'm pretty sure it can be improved (I'm guessing using the apply function but I'm not sure).
Any ideas?

Chris · Accepted Answer · 2019-05-23 13:05:03Z

3

Use pandas.DataFrame.groupby.cumsum:

import pandas as pd

df['result'] = df.groupby(df['check1'].cumsum())[['check1', 'check2']].cumsum().sum(1)

Or @Dan's suggestion:

df['result'] = df.groupby(df['check1'].cumsum())['check2'].cumsum().add(1)

Output:

    check1  check2  result
0     True   False     1.0
1    False   False     1.0
2    False   False     1.0
3    False   False     1.0
4    False   False     1.0
5    False   False     1.0
6    False    True     2.0
7    False   False     2.0
8    False    True     3.0
9    False   False     3.0
10   False    True     4.0
11   False   False     4.0
12   False    True     5.0
13   False   False     5.0
14   False    True     6.0
15   False   False     6.0
16   False    True     7.0
17   False   False     7.0
18   False   False     7.0
19   False   False     7.0
20   False    True     8.0
21   False   False     8.0
22   False    True     9.0
23    True   False     1.0
24   False   False     1.0

edited May 23, 2019 at 13:05

answered May 23, 2019 at 12:57

Chris

29.8k3 gold badges34 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Chris Over a year ago

@Dan Ah, sounds even better :) Updated the post

Serge Ballesta · Accepted Answer · 2019-05-23 13:08:51Z

0

You want to iterate a dataframe using the value of the preceding row. In that case, the most efficient way is to directly iterate the underlying numpy arrays:

df = pd.read_fwf(io.StringIO(t))

df['result'] = df.check1.astype(int)

res = df['result'].values
c1 = df['check1'].values
c2 = df['check2'].values
old = -1
for i in range(len(df)):
    if res[i] != 1:
        res[i] = old + int(c2[i])
    old = res[i]

This works fine because numpy arrays are mutable types, so the changes are reflected in the dataframe.

Timeit says that this is twice as fast as the original solution from @Chris's, and still 1.5 times faster after @Dan's improvement.

answered May 23, 2019 at 13:08

Serge Ballesta

150k13 gold badges137 silver badges267 bronze badges

Collectives™ on Stack Overflow

Applying values to a DataFrame without using a for-loop

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related