Concatenate strings across columns that are not null

Question

Wanting to do something similar to this, but complete row aggregation even when nulls are present without including them.

import pandas as pd
import numpy as np

df = pd.DataFrame(data= {'Subject': ['X', 'G', 'H', 'M'],
                         'Col1': ['cat', 'dog', np.nan, 'horse'],
                         'Col2': [np.nan, 'black', 'brown', 'grey'],
                         'Col3': ['small', 'medium', 'large', 'large']})

df['Col4'] = df['Col1'] + ', ' + df['Col2'] + ', ' + df['Col3']

For clarification, this is the resulting dataframe I am looking for

  Subject   Col1   Col2    Col3                Col4
0       X    cat    NaN   small          cat, small
1       G    dog  black  medium  dog, black, medium
2       H    NaN  brown   large        brown, large
3       M  horse   grey   large  horse, grey, large

Serge Ballesta · Accepted Answer · 2020-03-17 15:46:25Z

14

You could use apply, dropna and join to the column axis:

df['Col4'] = df[['Col1', 'Col2', 'Col3']].apply(lambda x: ','.join(x.dropna()), axis=1)

It gives as expected:

  Subject   Col1   Col2    Col3              Col4
0       X    cat    NaN   small         cat,small
1       G    dog  black  medium  dog,black,medium
2       H    NaN  brown   large       brown,large
3       M  horse   grey   large  horse,grey,large

It should be more or less 30% faster than @yatu's way for small dataframes like this one, but the other way is better for larger ones.

edited Mar 17, 2020 at 15:46

answered Mar 17, 2020 at 15:09

Serge Ballesta

150k13 gold badges137 silver badges267 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Trace R. Over a year ago

Oh nice, I'm not sure why I hadn't thought about this. Thank you :)

yatu Over a year ago

It doesn't scale as well though (see timings)

Serge Ballesta Over a year ago

@yatu: You are right. I only tried timeit with OP's data and did not control how it would scale. I have warned future readers of it in the post.

yatu · Accepted Answer · 2020-03-17 15:38:24Z

6

One approach is to set_index and stack (which will remove missing values), groupby on the first level, and aggregate with str.join:

df['Col4'] = (df.set_index('Subject')
                .stack()
                .groupby(level=0, sort=False)
                .agg(', '.join)
                .values)

print(df)

  Subject   Col1   Col2    Col3                Col4
0       X    cat    NaN   small          cat, small
1       G    dog  black  medium  dog, black, medium
2       H    NaN  brown   large        brown, large
3       M  horse   grey   large  horse, grey, large

Timings -

df_ = pd.concat([df]*1000, axis=0).reset_index(drop=True)

%timeit df_[['Col1', 'Col2', 'Col3']].apply(lambda x: ','.join(x.dropna()), axis=1)
# 743 ms ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit (df_.set_index('Subject').stack().groupby(level=0, sort=False).agg(', '.join).values)
# 5.73 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Mar 17, 2020 at 15:38

answered Mar 17, 2020 at 15:02

yatu

88.7k12 gold badges93 silver badges148 bronze badges

2 Comments

Trace R. Over a year ago

Thanks for the clarity on this, @yatu. I have selected Serge's answer based on it's simplicity and performance on my particular data set. Keeping yours in reference for the future if this ever changes. Thank you!

annena Over a year ago

Hi @yatu, could you help to explain this solution a little? why was "Subject" selected for set_index and what if I only want to join Col1 and Col3 but not Col3?

Weston A. Greene · Accepted Answer · 2024-09-22 21:50:24Z

The following version of yatu's answer is to answer @annena's question:

why was "Subject" selected for set_index and what if I only want to join Col1 and Col3 but not Col3?

def join_columns(df: pd.DataFrame, cols: list[str], join_str: str = '; ') -> pd.Series:
    df_cp = df.copy()
    at_least_one_col_populated = df_cp[cols].notnull().any(axis=1)
    df_cp.loc[at_least_one_col_populated, 'return_col'] = df_cp[cols].stack().groupby(level=0, sort=False).agg(join_str.join).values
    return df_cp['return_col']

df = pd.DataFrame({
    'col1': ['1', '1', None, '1', None],
    'col2': [None, None, None, None, None],
    'col3': ['2', '2', '2', None, None],
})

df['joined'] = join_columns(df, ['col1', 'col3'])
df

'

To answer with words instead of code: "Subject" was passed to set_index() because it was unique, I think. Which is not necessary in my function because I filter out rows that are all blank. And you could have specified Col1 and Col3 as dataframe slices, which is what my function does.

Collectives™ on Stack Overflow

Concatenate strings across columns that are not null

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related