2

I have a df that has one column in which the values are lists of values.

My intent is to split this column using some technique from here: Pandas split column of lists into multiple columns

However, for the column names I want to use each unique value from those lists of values.

To retrieve the unique values I have tried three different methods. Each one has failed with a different reason.

Is there a way to get Series.unique() when the values are a list of values?

My three attempts, with associated tracebacks:

1)
unique_vals = splitted_interests.unique()

Traceback (most recent call last):
  File "C:/Users/Mark/PycharmProjects/main/main.py", line 68, in <module>
    unique_vals = splitted_interests.unique()
  File "C:\Users\Mark\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\series.py", line 1991, in unique
    result = super().unique()
  File "C:\Users\Mark\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\base.py", line 1405, in unique
    result = unique1d(values)
  File "C:\Users\Mark\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\algorithms.py", line 405, in unique
    uniques = table.unique(values)
  File "pandas/_libs/hashtable_class_helper.pxi", line 1767, in pandas._libs.hashtable.PyObjectHashTable.unique
  File "pandas/_libs/hashtable_class_helper.pxi", line 1718, in pandas._libs.hashtable.PyObjectHashTable._unique
TypeError: unhashable type: 'list'


2)
unique_vals = splitted_interests.apply(lambda x: x.unique())

Traceback (most recent call last):
  File "C:/Users/Mark/PycharmProjects/main/main.py", line 68, in <module>
    unique_vals = splitted_interests.apply(lambda x: x.unique())
  File "C:\Users\Mark\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\series.py", line 4045, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/lib.pyx", line 2228, in pandas._libs.lib.map_infer
  File "C:/Users/Mark/PycharmProjects/main/main.py", line 68, in <lambda>
    unique_vals = splitted_interests.apply(lambda x: x.unique())
AttributeError: 'list' object has no attribute 'unique'

3)
unique_vals = splitted_interests.apply(lambda x: [y.unique() for y in x])

Traceback (most recent call last):
  File "C:\Users\Mark\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\series.py", line 4045, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/lib.pyx", line 2228, in pandas._libs.lib.map_infer
  File "C:/Users/Mark/PycharmProjects/main/main.py", line 68, in <lambda>
    unique_vals = splitted_interests.apply(lambda x: [y.unique() for y in x])
  File "C:/Users/Mark/PycharmProjects/main/main.py", line 68, in <listcomp>
    unique_vals = splitted_interests.apply(lambda x: [y.unique() for y in x])
AttributeError: 'str' object has no attribute 'unique'

At run time, the column with lists looks like this: enter image description here

3 Answers 3

3

"To retrieve the unique values I have tried three different methods. Each one has failed with a different reason."

you may wanna try astype('str') to retrieve unique values in a column:

df.<column>.astype('str').unique()
Sign up to request clarification or add additional context in comments.

1 Comment

This answer should get more upvotes
2

For same ordering create dictionaries and extract keys, solution working in python 3.6+:

df = pd.DataFrame({'JobRoleInterest':['aa,ss,ss','dd,ff','k,dd,dd,dd', 'j,gg']})
splitted_interests = df['JobRoleInterest'].str.split(',')

unique_vals = list(dict.fromkeys([y for x in splitted_interests for y in x]).keys())
print (unique_vals)
['aa', 'ss', 'dd', 'ff', 'k', 'j', 'gg']

6 Comments

@jezrael, close, but not really what I was after. In your answers above I still see items duplicated. dd appears in 1 and 2. It also appears twice in 2, which for a set I thought was impossible. All I am after is a list of the unique values so I can create column names from it.
No. Just a list of every unique value from all of the lists in the column. I will use that list to create column names. Then I plan to split the one column with list of values into multiple columns (one per unique value) with a single value per one of the techniques in the link I mentioned in my post.
@jezrael [aa, ss, dd, ff, k, j, gg] Single list (or set). One of each value.
@jezrael If other rows had matching values, I don't want\need them. Just a single entry per value anywhere in the entire column.
@jezrael I was at a location where my browser had not refreshed. I had not seen your updated answer. I just ran it and I got what I was after. Thanks!
|
2

I think you need, pd.Series.unique

Using @jezrael data:

df = pd.DataFrame({'JobRoleInterest':['aa,ss,ss','dd,ff','k,dd,dd,dd', 'j,gg']})

df['JobRoleInterest'].str.split(',', expand=True).stack().unique().tolist()

Output:

['aa', 'ss', 'dd', 'ff', 'k', 'j', 'gg']

Update using list data per @MarkS comments below:

df = pd.DataFrame({'JobRoleInterest':[['aa','ss','ss'],['dd','ff'],['k','dd','dd','dd'],['j','gg']]})
df['JobRoleInterest'].explode().unique().tolist()

Output:

['aa', 'ss', 'dd', 'ff', 'k', 'j', 'gg']

2 Comments

@ScottBoston I tried yours as well but I am getting back an empty list. Note the data in my lists in the column are not structured the way @jezrael structured his. Look above at the screen shot. Using @jezrael's data, it would be: ['aa', 'ss', 'ss', 'dd', 'ff', 'k', 'dd', 'dd', 'dd', 'j', 'gg']
@ScottBoston That worked as well, and it's easier (for me, at least) to understand. Much obliged.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.