1

suppose we have to following pandas dataframe

asd = pd.DataFrame({'A':['a', 'b', np.nan, 'c', np.nan], 'B':['f', np.nan, 'u', 'i', np.nan]})

I want to concat the values in columns 'A' and 'B' and put a comma ',' between them and put it into a new column asd['C'] if they are both are notnull(). Otherwise return either if the other one isnull(), or return np.nan if both are null() so the final outcome for column 'C' would be

asd['C'] = ['a, f', 'b', 'u', 'c, i', np.nan]

I tried the following

def f(asd):
if asd['A'].notnull() & asd['B'].notnull():
    asd['C'] = asd['A'] + ', ' + asd['B']
elif asd['A'].notnull() & asd['B'].isnull():
    asd['C'] = asd['A']
elif asd['A'].isnull() & asd['B'].notnull():
    asd['C'] = asd['B']
else:
    asd['C'] = np.nan
return asd['C']

asd['C'] = asd.apply(f, axis=1)

but it is giving me the following error

("'str' object has no attribute 'notnull'", 'occurred at index 0')

any help is really appreciated

2 Answers 2

3

Use apply + str.join:

df.apply(lambda x: ', '.join(x.dropna()), 1).replace('', np.nan)

0    a, f
1       b
2       u
3    c, i
4     NaN
dtype: object

The final replace call handles your np.nan requirement.

Sign up to request clarification or add additional context in comments.

4 Comments

it worked perfectly, thank you again and again @COLDSPEED you are saving me a lot of time
@MartinHeusen No worries. I am just a little disappointed pandas doesn't inherently support str methods across subslices or this could've been sped-up a lot.
yeah I know, apply() functionality always slow things down especially when dealing with dataframes with 10s of millions of rows like the problem I'm tackling.
@cᴏʟᴅsᴘᴇᴇᴅ add a new way ..since np.nan create a lot of problem here..
1

I think you can do this way ..

df['C']=df.stack().groupby(level=0).apply(','.join)
df
Out[459]: 
     A    B    C
0    a    f  a,f
1    b  NaN    b
2  NaN    u    u
3    c    i  c,i
4  NaN  NaN  NaN

Add timing :

small Data set:

%timeit df.apply(lambda x: ', '.join(x.dropna()), 1).replace('', np.nan)
1000 loops, best of 3: 1.6 ms per loop
%timeit df.stack().groupby(level=0).apply(','.join)
1000 loops, best of 3: 1.41 ms per loop

Large data set (both slow)

df=pd.concat([df]*1000,axis=1)
df=pd.concat([df]*1000,axis=0)
%timeit df.apply(lambda x: ', '.join(x.dropna()), 1).replace('', np.nan)
1 loop, best of 3: 2.1 s per loop
%timeit df.stack().groupby(level=0).apply(','.join)
1 loop, best of 3: 1.23 s per loop

3 Comments

I can see this as a good alternative. I am not sure about it speedwise... it would be interesting to see timings here.
@cᴏʟᴅsᴘᴇᴇᴅ add timing ~ :-)
Thanks! I'd say the slow speed of the second is due to the stack operation. But good one nonetheless.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.