Concatenating multiple DataFrame columns and removing multiple spaces

Question

I have a pandas DataFrame (20 x 1e6) with several name fields ['PREFIX', 'FIRST_NAME', 'MIDDLE_NAME', 'LAST_NAME', 'SUFFIX'] that I am trying to concatenate into a single field, 'FULLNAME'. The name fields often have whitespace at the beginning or end of the string, and furthermore many records have fields that are empty (ex. suffix = '').

Other answers suggest adding the fields as usual:

df['FULLNAME'] = df['PREFIX'].str.strip() + df['MIDDLE_NAME'].str.strip() + 
df['FIRST_NAME'].str.strip() + df['LAST_NAME'].str.strip() + 
df['SUFFIX'].str.strip()

The only problem here is that if a field is empty, I end up with a double-space in its place.

My (longwinded) solution is the following:

df['FULLNAME'] =  df[['PREFIX', 'FIRST_NAME', 'MIDDLE_NAME', 'LAST_NAME', 
'SUFFIX']].apply(lambda x: ' '.join(' '.join([item.strip() for item in 
x]).split()), axis = 1)

This solution works, but is relatively inefficient given I have over a million rows. Is there a more efficient operation I can do here? I suppose I could add the fields as in the first example, and then replace any number spaces:

df['FULLNAME'] =  df['FULLNAME'].str.replace('  ', ' ')

However, that may not be an encompassing solution given I do not know how many of the name fields may be blank for a given row.

cs95 · Accepted Answer · 2018-08-21 19:13:09Z

2

It's easier to aggregate your columns with agg and then just remove the extras later, using str.replace.

name_cols = ['PREFIX', 'FIRST_NAME', 'MIDDLE_NAME', 'LAST_NAME', 'SUFFIX']
df['FULLNAME'] = df[name_cols].agg(' '.join, axis=1).str.replace('\s+', ' ')

answered Aug 21, 2018 at 19:13

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Le Chase Over a year ago

Thanks! Is there any advantage to using agg over apply in this situation?

cs95 Over a year ago

@LeChase - agg is a little more optimised than apply in this station. They both end up doing the same thing, but agg is supposed to return a Series in any case.

Collectives™ on Stack Overflow

Concatenating multiple DataFrame columns and removing multiple spaces

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related