0

I'm quite new to Python and I'm trying to use Pandas (in iPython Notebook, Python 3) to combine three columns. This is the original data:

       RegistrationID  FirstName  MiddleInitial   LastName    
           1              John       P             Smith    
           2              Bill       Missing       Jones   
           3              Paul       H             Henry  

And I'd like to have:

   RegistrationID FirstName MiddleInitial   LastName    FullName
     1              John       P             Smith   Smith, John, P 
     2              Bill       Missing       Jones   Jones, Bill 
     3              Paul       H             Henry   Henry, Paul, H 

I'm sure this is absolutely not the correct way of doing this, but this is how I have set it up so far in a for loop. Unfortunately, it just keeps going and going and never finishes.

%matplotlib inline
import pandas as pd

from IPython.core.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))

reg = pd.DataFrame.from_csv('regcontact.csv', index_col=RegistrationID)

for item, frame in regcombo['MiddleInitial'].iteritems():
while frame == 'Missing':
   reg['FullName'] = reg.LastName.map(str) + ", " + reg.FirstName 
else: break 

The idea is then to add another column for those with complete names (i.e. including MiddleInitial):

for item, frame in regcombo['MiddleInitial'].iteritems():
while frame != 'Missing':
   reg['FullName1'] = reg.LastName.map(str) + ", " + reg.FirstName + ", " + reg.MiddleInitial
else: break 

And then combine them, so that there are no null values. I've looked everywhere, but I can't quite figure it out. Any help would be appreciated, and I apologize in advance if I have broken any conventions, as this is my first post.

2 Answers 2

1

This uses a list comprehension to create the new dataframe column, e.g. [(a, b, c) for a, b, c in some_iterable_item].

df['Full Name'] = [
   "{0}, {1} {2}"
   .format(last, first, middle if middle != 'Missing' else "").strip() 
   for last, first, middle 
   in df[['LastName', 'FirstName', 'MiddleInitial']].values]

>>> df
   RegistrationID FirstName MiddleInitial LastName      Full Name
0               1      John             P    Smith  Smith, John P
1               2      Bill       Missing    Jones    Jones, Bill
2               3      Paul             H    Henry  Henry, Paul H

The iterable_item is the array of values from the dataframe:

>>> df[['LastName', 'FirstName', 'MiddleInitial']].values
array([['Smith', 'John', 'P'],
       ['Jones', 'Bill', 'Missing'],
       ['Henry', 'Paul', 'H']], dtype=object)

So, per our list comprehension model:

>>> [(a, b, c) for (a, b, c) in df[['LastName', 'FirstName', 'MiddleInitial']].values]
[('Smith', 'John', 'P'), ('Jones', 'Bill', 'Missing'), ('Henry', 'Paul', 'H')]

I then format the string:

a = "Smith"
b = "John"
c = "P"
>>> "{0}, {1} {2}".format(a, b, c)
"Smith, John P"

I use a ternary to check if the middle name is 'Missing', so:

middle if middle != "Missing" else ""

is equivalent to:

if middle == 'Missing':
    middle = ""

Finally, I added .strip() to remove the extra space in case the middle name is missing.

Sign up to request clarification or add additional context in comments.

4 Comments

Many thanks for the comment on the "Missing" stuff - didn't notice that in the question.
This was incredibly helpful. Thank you!
Hi there - this wouldn't work if I wanted to check for a particular value in a column and then return a pre-defined string in another column (i.e. instead of a value in the array), correct? In that case I would need to revert to a loop like in my original example?
It would probably work fine with some modification, but you need to post a new question with a more specific example.
1

All you need to do is add the columns:

>>> df.FirstName + ', ' + df.LastName + ', ' + df.FullName.str.replace(', Missing', '')
0          John, Smith, P
1    Bill, Jones, Missing
2          Paul, Henry, H
dtype: object

To add a new column, you could just write:

df['FullName'] = df.FirstName + ', ' + ...

(In Pandas, it is usually attempted to avoid loops and such.)

2 Comments

The timing benefit is marginal once you add logic to remove 'Missing' from the middle name. You'll need something like df.FullName.str.replace(', Missing', "")
Thanks, @Alexander - didn't notice the part about Missing. Appreciated!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.