How to merged a column of list, extract unique string value, put into dataframe

Question

Going crazy, cant figure out where went wrong.

Have a file with dataframe, consist of single column, each row consist of 1 list

i am lost, please advice

fruits

0   ['apple', 'orange','grape']

1   ['apple','pineapple','coconut']

#

expected@

fruit

0   apple

1   coconut

2   grape

3   orange

4   pineapple

link

you just save half of my life, but how to keep only unique value?thanks!!! — Jonathan
– Jonathan, Commented Sep 5, 2019 at 16:08
Oh, already answered here. Lol, let me delete my answer then. — Joe
– Joe, Commented Sep 5, 2019 at 16:10
i tried both method, but stuck. pretty sure i done something wrong, but couldn't figure it out where. link — Jonathan
– Jonathan, Commented Sep 5, 2019 at 18:59

Joe · Accepted Answer · 2019-09-06 09:06:36Z

Flatten your data into a single list first then read it as column in your DataFrame:

>>> data = [[['apple', 'orange','grape']],[['apple','pineapple','coconut']]]
>>> data = np.unique(np.ravel(data))
>>> df = pd.DataFrame(data, columns = ['fruit'])
>>> df
       fruit
0      apple
1    coconut
2      grape
3     orange
4  pineapple

Edit for new case

Hi Jonathan, I replied to your email as to how you should go along with the entries if your column values if they are "like" lists. You need to use ast.literal_eval() on it.

>>> df = pd.DataFrame({'fruits': ['[\'apple\', \'orange\',\'grape\']','[\'apple\',\'pineapple\',\'coconut\']']})
>>> df
                            fruits
0      ['apple', 'orange','grape']
1  ['apple','pineapple','coconut']

Doing so however, you have to loop through the column, place every converted representation of your list into a dummy_list to gather everything into one list and do what you will in them.

>>> import ast
>>> dummy_list = []
>>> for i in range(0, len(df)):
...     dummy_list.extend(ast.literal_eval(df['fruits'][i]))
...
>>> dummy_list
['apple', 'orange', 'grape', 'apple', 'pineapple', 'coconut']

Getting the unique value and creating the DataFrame you want:

>>> x = list(set(dummy_list))
>>> x
['orange', 'apple', 'grape', 'coconut', 'pineapple']
>>> df2 = pd.DataFrame(x, columns = ['fruits 2.0'])
>>> df2
  fruits 2.0
0     orange
1      apple
2      grape
3    coconut
4  pineapple

Valdi_Bo · Accepted Answer · 2019-09-09 13:17:18Z

1

np.ravel alone (as Anky proposed) is not enough. You need then to remove duplicates. And if you are unhappy about non-continuous index, you are free to reset it.

So the complete code can be:

df = pd.DataFrame(np.ravel(data),columns=['fruit'])\
    .drop_duplicates().reset_index(drop=True)

np.unique (as in the other answer) has such a downside that it sorts the source array. I suppose you want to keep the original order.

Edit after your comment

It looks like you actually had a DataFrame, read using read_excel(), looking like below:

                        fruits
0       [apple, orange, grape]
1  [apple, pineapple, coconut]

(not a list presented in your post).

To convert such a DataFrame to a single, flat list, you can run:

lst = df['fruits'].apply(pd.Series).stack().drop_duplicates().to_list()

It in an "ordinary" (pythonic) list.

To create a second DataFrame with a single column, run:

df2 = pd.DataFrame(lst, columns=['fruits'])

Another option, without creation of an intermediate list:

df['fruits'].apply(pd.Series).stack().rename('fruits')\
    .drop_duplicates().reset_index(drop=True).to_frame()

Edit 2

I found a simpler solution, taking into account that read_excel reads by default all cells as strings.

The key to success is str.extractall method, applied to fruits column. To extract the text between apostrophes, the regex should be:

'(?P<fruits>[^']+)'

Details:

' - An apostrophe (represents itself), before the text to match.
(?P<fruits> - Start of a named capturing group (called also fruits).
[^']+ - The content of this group - a non-empty sequence of chars other than an apostrophe.
) - End of the capturing group.
' - Another apostrophe, after the text to match.

So if you run:

df.fruits.str.extractall(r"'(?P<fruits>[^']+)'")

you will get:

            fruits
  match           
0 0          apple
  1         orange
  2          grape
1 0          apple
  1      pineapple
  2        coconut

This result contains:

A MultiIndex:
- top level - the index of the source row (with no name),
- second level - match number (0, 1 and 2 for each row).
fruits - the name of the capturing group with individual strings in consecutive rows.

Now it remains only to drop duplicates and reset the index.

So the complete code, a single instruction is:

df.fruits.str.extractall("'(?P<fruits>[^']+)'")\
    .drop_duplicates().reset_index(drop=True)

The result is:

      fruits
0      apple
1     orange
2      grape
3  pineapple
4    coconut

edited Sep 9, 2019 at 13:17

answered Sep 5, 2019 at 16:40

Valdi_Bo

31.1k4 gold badges29 silver badges45 bronze badges

9 Comments

Jonathan Over a year ago

i facing one last problem. the 'data' that i post, is a 'list', meaning type(data) equal list. But when I pd.read_excel, the data type become dataframe, where i am stuck. when I convert it to a list, the 'columns=['fruit']' will be lost. thus unable to extract unique value. please advice

Valdi_Bo Over a year ago

In your post you didn't write about read_excel. So now it looks like you wanted to convert a DataFrame to a single column. Run print(df) from what you have read from Excel and add to your post.

Jonathan Over a year ago

fruits 0 ['apple', 'orange','grape'] 1 ['apple','pineapple','coconut'] pandas.core.frame.DataFrame

Jonathan Over a year ago

sorry about didn't mention in beginning, as i want to post the 'data' without knowing the difference. \ But now, i am lost. after i [df = pd.read_excel], i follow up with your 'another option' code. the result is\ fruits 0 ['apple', 'orange','grape'] 1 ['apple','pineapple','coconut'] which part did i miss?

Valdi_Bo Over a year ago

fruits should be the name of (the only) column. Maybe in your DataFrame it is the first level of MultiIndex?

|

Collectives™ on Stack Overflow

How to merged a column of list, extract unique string value, put into dataframe

2 Answers 2

Edit for new case

Comments

Edit after your comment

Edit 2

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Edit for new case

Comments

Edit after your comment

Edit 2

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related