6

I have a pandas dataframe:

|  col1  | heading |
|--------|---------|
|heading1|   true  |
|abc     |  false  |
|efg     |  false  |
|hij     |  false  |
|heading2|   true  |
|klm     |  false  |
|...     |  false  |

This data is actually "sequential" and I would like to transform it to this structure:

|  col1  |  Parent   |
|---------------------
|heading1|  heading1 |
|abc     |  heading1 | 
|efg     |  heading1 |
|hij     |  heading1 |
|heading2|  heading2 |
|klm     |  heading2 |
|...     |  headingN |

I have +10M rows so this method takes too long:

df['Parent'] = df['col1']

for index, row in df.iterrows():
    if row['heading']:
        current = row['col1']
    else:
        row.loc[index, 'Parent'] = current

Do you have any advice on a faster process?

1
  • How about using a fill function? Commented Oct 20, 2018 at 16:22

5 Answers 5

5

You can use a mask with ffill:

df.assign(heading=df.col1.mask(~df.col1.str.startswith('heading')).ffill())

       col1   heading
0  heading1  heading1
1       abc  heading1
2       efg  heading1
3       hij  heading1
4  heading2  heading2
5       klm  heading2

This works by replacing any value that does not start with heading with NaN, and then fills the last non-nan value forward:

df.col1.mask(~df.col1.str.startswith('heading'))

0    heading1
1         NaN
2         NaN
3         NaN
4    heading2
5         NaN
Name: col1, dtype: object

df.col1.mask(~df.col1.str.startswith('heading')).ffill()

0    heading1
1    heading1
2    heading1
3    heading1
4    heading2
5    heading2
Name: col1, dtype: object
Sign up to request clarification or add additional context in comments.

1 Comment

Yes, also had a ffill in mind. But this doesn't even use the column heading. Smart
3

Probably not a very pandas-idiomatic solution but you can cumsum the logical column and use that to grab the corresponding heading for each row. In essence we're defining a piecewise-constant index array that only gets incremented for each True value on the original heading column.

import pandas as pd

# set up some dummy data
df = pd.DataFrame({'heading': [True, False, False, False, True, False, False]},
                  index=['heading1', 'foo', 'bar', 'baz', 'heading2', 'quux', 'quuz'])

# get every 'heading' index
headings = df.index[df.heading]
# fetch which row corresponds to which 'heading'
indices = df.heading.cumsum() - 1
# fetch the actual headings for each row
df['parent'] = headings[indices]

print(df)

The output of the above code is

          heading    parent
heading1     True  heading1
foo         False  heading1
bar         False  heading1
baz         False  heading1
heading2     True  heading2
quux        False  heading2
quuz        False  heading2

From which you can drop the unnecessary heading column. Of course you can directly get the logical array you have and work with that:

headline = df.index.str.startswith('heading') # bool Series
headings = df.index[headline]
indices = df.heading.cumsum() - 1
df['parent'] = headings[indices]

Comments

2

I thought of a ffill too. By using df.pop() we make sure the column disappears too.

df['Parent'] = df['col1'].mul(df.pop('heading')).replace('',np.nan).ffill()

Full example

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'col1': ['heading1', 'abc', 'efg', 'hij', 'heading2', 'klm'],
    'heading': [True, False, False, False, True, False]
})

df['Parent'] = df['col1'].mul(df.pop('heading')).replace('',np.nan).ffill()
print(df)

Returns:

       col1    Parent
0  heading1  heading1
1       abc  heading1
2       efg  heading1
3       hij  heading1
4  heading2  heading2
5       klm  heading2

Comments

1

where + pop + ffill

You may find this more efficient. Data from @AntonvBR.

df['Parent'] = df['col1'].where(df.pop('heading')).ffill()

print(df)

       col1    Parent
0  heading1  heading1
1       abc  heading1
2       efg  heading1
3       hij  heading1
4  heading2  heading2
5       klm  heading2

1 Comment

Ohhh you... I was looking for this. I knew it ... (was laboring with np.where).
1

To throw in a completely different method, you could obtain an index by iterating through your boolean array and afterwards using it as a map for your headers. I don't know how fast the header mapping is but you can index the booleans quickly..

import numpy as np
from numba import jit
bool_array = np.array([True, False], dtype=np.bool)
boolean_array = np.random.choice(bool_array, size=100000000)
@jit(nopython=True)
def reassign(boolean_array):
    b = np.zeros(shape=(len(boolean_array),), dtype=np.int32)
    b[0] = 0
    for i in range(1,len(boolean_array)):
        if boolean_array[i]:
            b[i] = i
        else:
            b[i] = b[i-1]
    return b

import time
start = time.time()
print(reassign(boolean_array))
print("took {} seconds".format(time.time()-start))

Takes 0.5 seconds with Numba and 130 seconds without, for 100mm

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.