1

I have a DataFrame like this:

data = {'col1': ['A', 'B', 'B', 'A', 'B', 'C', 'B', 'B', 'B', 
                  'A', 'C', 'A', 'B', 'C'],
        'col2': ['NaN', 'comment1', 'comment2', 'NaN', 'comment3', NaN,
                 'comment4', 'comment5', 'comment6', 
                 'NaN', 'NaN', 'NaN', 'comment7', 'NaN]}

frame = pd.DataFrame(data)
frame

col1  col2
A     NaN
B     comment1
B     comment2
A     NaN
B     comment3
C     NaN
B     comment4
B     comment5
B     comment6
A     NaN
C     NaN
A     NaN
B     comment7
C     NaN

Each row with col1 == 'B' has a comment which will be a string. I need to aggregate the comments and fill the preceding row (where col1 != 'B') with the resulting aggregated string.

Any given row where col1 != 'B' could have none, one or many corresponding rows of comments (col1 == 'B') which seems to be the crux of the problem. I can't just use fillna('bfill') etc.

I have looked into iterrows(), groupby(), while loops and tried to build my own function. But, I don't think I'm fully understanding how all of those are working.

Finished product should look like this:

col1    col2
A       comment1 + comment2
B       comment1
B       comment2
A       comment3
B       comment3
C       comment4 + comment5 + comment6
B       comment4
B       comment5
B       comment6
A       NaN
C       NaN
A       comment7
B       comment7
C       NaN

Eventually I will be dropping all rows where col1 == 'B', but for now I'd like to keep them for verification.

2 Answers 2

3

Here's one way using GroupBy with a custom grouper to concatenate the strings where col1 is B:

where_a = frame.col1.ne('B') 
g = where_a.cumsum()
com = frame[frame.col1.eq('B')].groupby(g).col2.agg(lambda x: x.str.cat(sep=' + '))
till = (frame.col2.isna() & frame.col2.shift(-1).notna())[::-1].idxmax()
ixs = where_a[:till+1].reindex(frame.index).fillna(False)
frame.loc[ixs, 'col2'] = com.values

print(frame)

    col1                         col2
0     A             comment1 + comment2
1     B                        comment1
2     B                        comment2
3     A                        comment3
4     B                        comment3
5     C  comment4 + comment5 + comment6
6     B                        comment4
7     B                        comment5
8     B                        comment6
9     A                             NaN
10    C                             NaN
Sign up to request clarification or add additional context in comments.

12 Comments

So, this did work for the example I gave, but not for the larger dataframe I am working with. I am getting the following ValueError: "Must have equal len keys and value when setting with an iterable.
This means that you have some col1 which does not Co tain or A or B @nomad10 perhaps consecutive As? Let me check later for these cases
So I'm assuming that any value other than B must be filled with the following comments? @nomad10
Yes, your assumption is correct. Tried the updated code but still got the same error.
Okay, should work now @nomad10 . Make sure NaNs are proper NaNs, i.e. np.nan. A sring 'NaN' isn't useful
|
0
df['col_group'] = -1
col_group = 0
for i in df.index:
    if df.loc[i, 'col1'] != 'B':
        col_group += 1
    df.loc[i, 'col_group'] = col_group

comments = df[df['col1'] == 'B']
transactions = df[df['col1'] != 'B']
agg_comments = comments.groupby('col_group')['col2'].apply(lambda x: reduce(lambda i,j: i+"&$#"+j,x)).reset_index()
df = pd.merge(transactions, agg_comments, on='col_group', how='outer')

1 Comment

So, this does work but it is extremely slow for hundreds of thousands to 1 million rows. Does anyone have any suggestions on how this could be sped up?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.