I have a pandas DataFrame (20 x 1e6) with several name fields ['PREFIX', 'FIRST_NAME', 'MIDDLE_NAME', 'LAST_NAME', 'SUFFIX'] that I am trying to concatenate into a single field, 'FULLNAME'. The name fields often have whitespace at the beginning or end of the string, and furthermore many records have fields that are empty (ex. suffix = '').
Other answers suggest adding the fields as usual:
df['FULLNAME'] = df['PREFIX'].str.strip() + df['MIDDLE_NAME'].str.strip() +
df['FIRST_NAME'].str.strip() + df['LAST_NAME'].str.strip() +
df['SUFFIX'].str.strip()
The only problem here is that if a field is empty, I end up with a double-space in its place.
My (longwinded) solution is the following:
df['FULLNAME'] = df[['PREFIX', 'FIRST_NAME', 'MIDDLE_NAME', 'LAST_NAME',
'SUFFIX']].apply(lambda x: ' '.join(' '.join([item.strip() for item in
x]).split()), axis = 1)
This solution works, but is relatively inefficient given I have over a million rows. Is there a more efficient operation I can do here? I suppose I could add the fields as in the first example, and then replace any number spaces:
df['FULLNAME'] = df['FULLNAME'].str.replace(' ', ' ')
However, that may not be an encompassing solution given I do not know how many of the name fields may be blank for a given row.