Speeding up loop when normalizing Pandas data

Question

I have a pandas dataframe:

|  col1  | heading |
|--------|---------|
|heading1|   true  |
|abc     |  false  |
|efg     |  false  |
|hij     |  false  |
|heading2|   true  |
|klm     |  false  |
|...     |  false  |

This data is actually "sequential" and I would like to transform it to this structure:

|  col1  |  Parent   |
|---------------------
|heading1|  heading1 |
|abc     |  heading1 | 
|efg     |  heading1 |
|hij     |  heading1 |
|heading2|  heading2 |
|klm     |  heading2 |
|...     |  headingN |

I have +10M rows so this method takes too long:

df['Parent'] = df['col1']

for index, row in df.iterrows():
    if row['heading']:
        current = row['col1']
    else:
        row.loc[index, 'Parent'] = current

Do you have any advice on a faster process?

How about using a fill function?

Anton vBR
– Anton vBR

2018-10-20 16:22:06 +00:00
Commented Oct 20, 2018 at 16:22 — Anton vBR
– Anton vBR, Commented Oct 20, 2018 at 16:22

user3483203 · Accepted Answer · 2018-10-20 16:24:53Z

5

You can use a mask with ffill:

df.assign(heading=df.col1.mask(~df.col1.str.startswith('heading')).ffill())

       col1   heading
0  heading1  heading1
1       abc  heading1
2       efg  heading1
3       hij  heading1
4  heading2  heading2
5       klm  heading2

This works by replacing any value that does not start with heading with NaN, and then fills the last non-nan value forward:

df.col1.mask(~df.col1.str.startswith('heading'))

0    heading1
1         NaN
2         NaN
3         NaN
4    heading2
5         NaN
Name: col1, dtype: object

df.col1.mask(~df.col1.str.startswith('heading')).ffill()

0    heading1
1    heading1
2    heading1
3    heading1
4    heading2
5    heading2
Name: col1, dtype: object

answered Oct 20, 2018 at 16:24

user3483203

51.3k10 gold badges72 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Anton vBR Over a year ago

Yes, also had a ffill in mind. But this doesn't even use the column heading. Smart

Andras Deak -- Слава Україні · Accepted Answer · 2018-10-20 16:17:57Z

Probably not a very pandas-idiomatic solution but you can cumsum the logical column and use that to grab the corresponding heading for each row. In essence we're defining a piecewise-constant index array that only gets incremented for each True value on the original heading column.

import pandas as pd

# set up some dummy data
df = pd.DataFrame({'heading': [True, False, False, False, True, False, False]},
                  index=['heading1', 'foo', 'bar', 'baz', 'heading2', 'quux', 'quuz'])

# get every 'heading' index
headings = df.index[df.heading]
# fetch which row corresponds to which 'heading'
indices = df.heading.cumsum() - 1
# fetch the actual headings for each row
df['parent'] = headings[indices]

print(df)

The output of the above code is

          heading    parent
heading1     True  heading1
foo         False  heading1
bar         False  heading1
baz         False  heading1
heading2     True  heading2
quux        False  heading2
quuz        False  heading2

From which you can drop the unnecessary heading column. Of course you can directly get the logical array you have and work with that:

headline = df.index.str.startswith('heading') # bool Series
headings = df.index[headline]
indices = df.heading.cumsum() - 1
df['parent'] = headings[indices]

Anton vBR · Accepted Answer · 2018-10-20 16:36:45Z

2

I thought of a ffill too. By using df.pop() we make sure the column disappears too.

df['Parent'] = df['col1'].mul(df.pop('heading')).replace('',np.nan).ffill()

Full example

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'col1': ['heading1', 'abc', 'efg', 'hij', 'heading2', 'klm'],
    'heading': [True, False, False, False, True, False]
})

df['Parent'] = df['col1'].mul(df.pop('heading')).replace('',np.nan).ffill()
print(df)

Returns:

       col1    Parent
0  heading1  heading1
1       abc  heading1
2       efg  heading1
3       hij  heading1
4  heading2  heading2
5       klm  heading2

edited Oct 20, 2018 at 16:36

answered Oct 20, 2018 at 16:31

Anton vBR

19k6 gold badges47 silver badges47 bronze badges

Comments

jpp · Accepted Answer · 2018-10-20 16:45:02Z

1

`where` + `pop` + `ffill`

You may find this more efficient. Data from @AntonvBR.

df['Parent'] = df['col1'].where(df.pop('heading')).ffill()

print(df)

       col1    Parent
0  heading1  heading1
1       abc  heading1
2       efg  heading1
3       hij  heading1
4  heading2  heading2
5       klm  heading2

answered Oct 20, 2018 at 16:45

jpp

166k37 gold badges301 silver badges363 bronze badges

1 Comment

Anton vBR Over a year ago

Ohhh you... I was looking for this. I knew it ... (was laboring with np.where).

Attack68 · Accepted Answer · 2018-10-20 18:09:27Z

To throw in a completely different method, you could obtain an index by iterating through your boolean array and afterwards using it as a map for your headers. I don't know how fast the header mapping is but you can index the booleans quickly..

import numpy as np
from numba import jit
bool_array = np.array([True, False], dtype=np.bool)
boolean_array = np.random.choice(bool_array, size=100000000)
@jit(nopython=True)
def reassign(boolean_array):
    b = np.zeros(shape=(len(boolean_array),), dtype=np.int32)
    b[0] = 0
    for i in range(1,len(boolean_array)):
        if boolean_array[i]:
            b[i] = i
        else:
            b[i] = b[i-1]
    return b

import time
start = time.time()
print(reassign(boolean_array))
print("took {} seconds".format(time.time()-start))

Takes 0.5 seconds with Numba and 130 seconds without, for 100mm

Collectives™ on Stack Overflow

Speeding up loop when normalizing Pandas data

5 Answers 5

1 Comment

Comments

Comments

`where` + `pop` + `ffill`

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

Comments

Comments

where + pop + ffill

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related

`where` + `pop` + `ffill`