6

I have following sample DataFrame d consisting of two columns 'col1' and 'col2'. I would like to find the list of unique names for the whole DataFrame d.

    d = {'col1':['Pat, Joseph', 
                 'Tony, Hoffman', 
                 'Miriam, Goodwin', 
                 'Roxanne, Padilla',
                 'Julie, Davis', 
                 'Muriel, Howell', 
                 'Salvador, Reese', 
                 'Kristopher, Mckenzie',
                 'Lucille, Thornton', 
                 'Brenda, Wilkerson'],

     'col2':['Kristopher, Mckenzie', 
             'Lucille, Thornton',
             'Pete, Fitzgerald; Cecelia, Bass; Julie, Davis', 
             'Muriel, Howell', 'Harriet, Phillips',
             'Belinda, Drake;David, Ford', 'Jared, Cummings;Joanna, Burns;Bob, Cunningham',
             'Keith, Hernandez;Pat, Joseph', 'Kristopher, Mckenzie', 'Lucille, Thornton']}

    df = pd.DataFrame(data=d)

For column col1 i can get it done by using function unique().

df.col1.unique()
array(['Pat, Joseph', 'Tony, Hoffman', 'Miriam, Goodwin',
       'Roxanne, Padilla', 'Julie, Davis', 'Muriel, Howell',
       'Salvador, Reese', 'Kristopher, Mckenzie', 'Lucille, Thornton',
       'Brenda, Wilkerson'], dtype=object)
len(df.col1) 10 # total number of rows
len(df.col1.unique())  9 # total number of unique rows

For col2 some of the rows have multiple names separated by a semicolon. e.g. 'Pete, Fitzgerald; Cecelia, Bass; Julie, Davis'.

How can I get the unique names from the col2 using vector operation? I am trying to avoid the for loop since the actual data set is large.

1 Answer 1

10

First split by ;s\* (regex - ; with zero or more whitespaces) to DataFrame, then reshape by stack for Series and last use unique:

print (df['col2'].str.split(';\s*', expand=True).stack().unique())
['Kristopher, Mckenzie' 'Lucille, Thornton' 'Pete, Fitzgerald'
 'Cecelia, Bass' 'Julie, Davis' 'Muriel, Howell' 'Harriet, Phillips'
 'Belinda, Drake' 'David, Ford' 'Jared, Cummings' 'Joanna, Burns'
 'Bob, Cunningham' 'Keith, Hernandez' 'Pat, Joseph']

Detail:

print (df['col2'].str.split(';\s*', expand=True))
                      0               1                2
0  Kristopher, Mckenzie            None             None
1     Lucille, Thornton            None             None
2      Pete, Fitzgerald   Cecelia, Bass     Julie, Davis
3        Muriel, Howell            None             None
4     Harriet, Phillips            None             None
5        Belinda, Drake     David, Ford             None
6       Jared, Cummings   Joanna, Burns  Bob, Cunningham
7      Keith, Hernandez     Pat, Joseph             None
8  Kristopher, Mckenzie            None             None
9     Lucille, Thornton            None             None

print (df['col2'].str.split(';\s*', expand=True).stack())
0  0    Kristopher, Mckenzie
1  0       Lucille, Thornton
2  0        Pete, Fitzgerald
   1           Cecelia, Bass
   2            Julie, Davis
3  0          Muriel, Howell
4  0       Harriet, Phillips
5  0          Belinda, Drake
   1             David, Ford
6  0         Jared, Cummings
   1           Joanna, Burns
   2         Bob, Cunningham
7  0        Keith, Hernandez
   1             Pat, Joseph
8  0    Kristopher, Mckenzie
9  0       Lucille, Thornton
dtype: object

Alternative solution:

print (np.unique(np.concatenate(df['col2'].str.split(';\s*').values)))
['Belinda, Drake' 'Bob, Cunningham' 'Cecelia, Bass' 'David, Ford'
 'Harriet, Phillips' 'Jared, Cummings' 'Joanna, Burns' 'Julie, Davis'
 'Keith, Hernandez' 'Kristopher, Mckenzie' 'Lucille, Thornton'
 'Muriel, Howell' 'Pat, Joseph' 'Pete, Fitzgerald']

EDIT:

For all unique names add stack first for Series form all columns:

print (df.stack().str.split(';\s*', expand=True).stack().unique())

['Pat, Joseph' 'Kristopher, Mckenzie' 'Tony, Hoffman' 'Lucille, Thornton'
 'Miriam, Goodwin' 'Pete, Fitzgerald' 'Cecelia, Bass' 'Julie, Davis'
 'Roxanne, Padilla' 'Muriel, Howell' 'Harriet, Phillips' 'Belinda, Drake'
 'David, Ford' 'Salvador, Reese' 'Jared, Cummings' 'Joanna, Burns'
 'Bob, Cunningham' 'Keith, Hernandez' 'Brenda, Wilkerson']
Sign up to request clarification or add additional context in comments.

2 Comments

df.col2.str.split(';',expand=True).stack().unique() this was my solution when I came :(. Arghh should come fast next time
Thank you for the quick solution. Since it was easy I am going to ask you another one. How can find all the unique names in above data frame?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.