I have following sample DataFrame d consisting of two columns 'col1' and 'col2'. I would like to find the list of unique names for the whole DataFrame d.
d = {'col1':['Pat, Joseph',
'Tony, Hoffman',
'Miriam, Goodwin',
'Roxanne, Padilla',
'Julie, Davis',
'Muriel, Howell',
'Salvador, Reese',
'Kristopher, Mckenzie',
'Lucille, Thornton',
'Brenda, Wilkerson'],
'col2':['Kristopher, Mckenzie',
'Lucille, Thornton',
'Pete, Fitzgerald; Cecelia, Bass; Julie, Davis',
'Muriel, Howell', 'Harriet, Phillips',
'Belinda, Drake;David, Ford', 'Jared, Cummings;Joanna, Burns;Bob, Cunningham',
'Keith, Hernandez;Pat, Joseph', 'Kristopher, Mckenzie', 'Lucille, Thornton']}
df = pd.DataFrame(data=d)
For column col1 i can get it done by using function unique().
df.col1.unique()
array(['Pat, Joseph', 'Tony, Hoffman', 'Miriam, Goodwin',
'Roxanne, Padilla', 'Julie, Davis', 'Muriel, Howell',
'Salvador, Reese', 'Kristopher, Mckenzie', 'Lucille, Thornton',
'Brenda, Wilkerson'], dtype=object)
len(df.col1) 10 # total number of rows len(df.col1.unique()) 9 # total number of unique rows
For col2 some of the rows have multiple names separated by a semicolon. e.g. 'Pete, Fitzgerald; Cecelia, Bass; Julie, Davis'.
How can I get the unique names from the col2 using vector operation? I am trying to avoid the for loop since the actual data set is large.