Add two columns in pandas with different output depending on multiple conditions

Question

suppose we have to following pandas dataframe

asd = pd.DataFrame({'A':['a', 'b', np.nan, 'c', np.nan], 'B':['f', np.nan, 'u', 'i', np.nan]})

I want to concat the values in columns 'A' and 'B' and put a comma ',' between them and put it into a new column asd['C'] if they are both are notnull(). Otherwise return either if the other one isnull(), or return np.nan if both are null() so the final outcome for column 'C' would be

asd['C'] = ['a, f', 'b', 'u', 'c, i', np.nan]

I tried the following

def f(asd):
if asd['A'].notnull() & asd['B'].notnull():
    asd['C'] = asd['A'] + ', ' + asd['B']
elif asd['A'].notnull() & asd['B'].isnull():
    asd['C'] = asd['A']
elif asd['A'].isnull() & asd['B'].notnull():
    asd['C'] = asd['B']
else:
    asd['C'] = np.nan
return asd['C']

asd['C'] = asd.apply(f, axis=1)

but it is giving me the following error

("'str' object has no attribute 'notnull'", 'occurred at index 0')

any help is really appreciated

cs95 · Accepted Answer · 2017-10-24 01:08:37Z

3

Use apply + str.join:

df.apply(lambda x: ', '.join(x.dropna()), 1).replace('', np.nan)

0    a, f
1       b
2       u
3    c, i
4     NaN
dtype: object

The final replace call handles your np.nan requirement.

answered Oct 24, 2017 at 1:08

cs95

406k106 gold badges745 silver badges798 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Martin Heusen Over a year ago

it worked perfectly, thank you again and again @COLDSPEED you are saving me a lot of time

cs95 Over a year ago

@MartinHeusen No worries. I am just a little disappointed pandas doesn't inherently support str methods across subslices or this could've been sped-up a lot.

Martin Heusen Over a year ago

yeah I know, apply() functionality always slow things down especially when dealing with dataframes with 10s of millions of rows like the problem I'm tackling.

BENY Over a year ago

@cᴏʟᴅsᴘᴇᴇᴅ add a new way ..since np.nan create a lot of problem here..

BENY · Accepted Answer · 2017-10-24 02:17:51Z

1

I think you can do this way ..

df['C']=df.stack().groupby(level=0).apply(','.join)
df
Out[459]: 
     A    B    C
0    a    f  a,f
1    b  NaN    b
2  NaN    u    u
3    c    i  c,i
4  NaN  NaN  NaN

Add timing :

small Data set:

%timeit df.apply(lambda x: ', '.join(x.dropna()), 1).replace('', np.nan)
1000 loops, best of 3: 1.6 ms per loop
%timeit df.stack().groupby(level=0).apply(','.join)
1000 loops, best of 3: 1.41 ms per loop

Large data set (both slow)

df=pd.concat([df]*1000,axis=1)
df=pd.concat([df]*1000,axis=0)
%timeit df.apply(lambda x: ', '.join(x.dropna()), 1).replace('', np.nan)
1 loop, best of 3: 2.1 s per loop
%timeit df.stack().groupby(level=0).apply(','.join)
1 loop, best of 3: 1.23 s per loop

edited Oct 24, 2017 at 2:17

answered Oct 24, 2017 at 2:10

BENY

324k22 gold badges176 silver badges250 bronze badges

3 Comments

cs95 Over a year ago

I can see this as a good alternative. I am not sure about it speedwise... it would be interesting to see timings here.

BENY Over a year ago

@cᴏʟᴅsᴘᴇᴇᴅ add timing ~ :-)

cs95 Over a year ago

Thanks! I'd say the slow speed of the second is due to the stack operation. But good one nonetheless.

Collectives™ on Stack Overflow

Add two columns in pandas with different output depending on multiple conditions

2 Answers 2

4 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related