5

Wanting to do something similar to this, but complete row aggregation even when nulls are present without including them.

import pandas as pd
import numpy as np

df = pd.DataFrame(data= {'Subject': ['X', 'G', 'H', 'M'],
                         'Col1': ['cat', 'dog', np.nan, 'horse'],
                         'Col2': [np.nan, 'black', 'brown', 'grey'],
                         'Col3': ['small', 'medium', 'large', 'large']})

df['Col4'] = df['Col1'] + ', ' + df['Col2'] + ', ' + df['Col3']

For clarification, this is the resulting dataframe I am looking for

  Subject   Col1   Col2    Col3                Col4
0       X    cat    NaN   small          cat, small
1       G    dog  black  medium  dog, black, medium
2       H    NaN  brown   large        brown, large
3       M  horse   grey   large  horse, grey, large
0

3 Answers 3

14

You could use apply, dropna and join to the column axis:

df['Col4'] = df[['Col1', 'Col2', 'Col3']].apply(lambda x: ','.join(x.dropna()), axis=1)

It gives as expected:

  Subject   Col1   Col2    Col3              Col4
0       X    cat    NaN   small         cat,small
1       G    dog  black  medium  dog,black,medium
2       H    NaN  brown   large       brown,large
3       M  horse   grey   large  horse,grey,large

It should be more or less 30% faster than @yatu's way for small dataframes like this one, but the other way is better for larger ones.

Sign up to request clarification or add additional context in comments.

3 Comments

Oh nice, I'm not sure why I hadn't thought about this. Thank you :)
It doesn't scale as well though (see timings)
@yatu: You are right. I only tried timeit with OP's data and did not control how it would scale. I have warned future readers of it in the post.
6

One approach is to set_index and stack (which will remove missing values), groupby on the first level, and aggregate with str.join:

df['Col4'] = (df.set_index('Subject')
                .stack()
                .groupby(level=0, sort=False)
                .agg(', '.join)
                .values)

print(df)

  Subject   Col1   Col2    Col3                Col4
0       X    cat    NaN   small          cat, small
1       G    dog  black  medium  dog, black, medium
2       H    NaN  brown   large        brown, large
3       M  horse   grey   large  horse, grey, large

Timings -

df_ = pd.concat([df]*1000, axis=0).reset_index(drop=True)

%timeit df_[['Col1', 'Col2', 'Col3']].apply(lambda x: ','.join(x.dropna()), axis=1)
# 743 ms ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit (df_.set_index('Subject').stack().groupby(level=0, sort=False).agg(', '.join).values)
# 5.73 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

2 Comments

Thanks for the clarity on this, @yatu. I have selected Serge's answer based on it's simplicity and performance on my particular data set. Keeping yours in reference for the future if this ever changes. Thank you!
Hi @yatu, could you help to explain this solution a little? why was "Subject" selected for set_index and what if I only want to join Col1 and Col3 but not Col3?
0

The following version of yatu's answer is to answer @annena's question:

why was "Subject" selected for set_index and what if I only want to join Col1 and Col3 but not Col3?

def join_columns(df: pd.DataFrame, cols: list[str], join_str: str = '; ') -> pd.Series:
    df_cp = df.copy()
    at_least_one_col_populated = df_cp[cols].notnull().any(axis=1)
    df_cp.loc[at_least_one_col_populated, 'return_col'] = df_cp[cols].stack().groupby(level=0, sort=False).agg(join_str.join).values
    return df_cp['return_col']

df = pd.DataFrame({
    'col1': ['1', '1', None, '1', None],
    'col2': [None, None, None, None, None],
    'col3': ['2', '2', '2', None, None],
})

df['joined'] = join_columns(df, ['col1', 'col3'])
df

enter image description here'

To answer with words instead of code: "Subject" was passed to set_index() because it was unique, I think. Which is not necessary in my function because I filter out rows that are all blank. And you could have specified Col1 and Col3 as dataframe slices, which is what my function does.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.