0

Going crazy, cant figure out where went wrong.

Have a file with dataframe, consist of single column, each row consist of 1 list

i am lost, please advice

fruits

0   ['apple', 'orange','grape']

1   ['apple','pineapple','coconut']
#

expected@

fruit

0   apple

1   coconut

2   grape

3   orange

4   pineapple

link

3
  • you just save half of my life, but how to keep only unique value?thanks!!! Commented Sep 5, 2019 at 16:08
  • Oh, already answered here. Lol, let me delete my answer then. Commented Sep 5, 2019 at 16:10
  • i tried both method, but stuck. pretty sure i done something wrong, but couldn't figure it out where. link Commented Sep 5, 2019 at 18:59

2 Answers 2

2

Flatten your data into a single list first then read it as column in your DataFrame:

>>> data = [[['apple', 'orange','grape']],[['apple','pineapple','coconut']]]
>>> data = np.unique(np.ravel(data))
>>> df = pd.DataFrame(data, columns = ['fruit'])
>>> df
       fruit
0      apple
1    coconut
2      grape
3     orange
4  pineapple

Edit for new case

Hi Jonathan, I replied to your email as to how you should go along with the entries if your column values if they are "like" lists. You need to use ast.literal_eval() on it.

>>> df = pd.DataFrame({'fruits': ['[\'apple\', \'orange\',\'grape\']','[\'apple\',\'pineapple\',\'coconut\']']})
>>> df
                            fruits
0      ['apple', 'orange','grape']
1  ['apple','pineapple','coconut']

Doing so however, you have to loop through the column, place every converted representation of your list into a dummy_list to gather everything into one list and do what you will in them.

>>> import ast
>>> dummy_list = []
>>> for i in range(0, len(df)):
...     dummy_list.extend(ast.literal_eval(df['fruits'][i]))
...
>>> dummy_list
['apple', 'orange', 'grape', 'apple', 'pineapple', 'coconut']

Getting the unique value and creating the DataFrame you want:

>>> x = list(set(dummy_list))
>>> x
['orange', 'apple', 'grape', 'coconut', 'pineapple']
>>> df2 = pd.DataFrame(x, columns = ['fruits 2.0'])
>>> df2
  fruits 2.0
0     orange
1      apple
2      grape
3    coconut
4  pineapple
Sign up to request clarification or add additional context in comments.

Comments

1

np.ravel alone (as Anky proposed) is not enough. You need then to remove duplicates. And if you are unhappy about non-continuous index, you are free to reset it.

So the complete code can be:

df = pd.DataFrame(np.ravel(data),columns=['fruit'])\
    .drop_duplicates().reset_index(drop=True)

np.unique (as in the other answer) has such a downside that it sorts the source array. I suppose you want to keep the original order.

Edit after your comment

It looks like you actually had a DataFrame, read using read_excel(), looking like below:

                        fruits
0       [apple, orange, grape]
1  [apple, pineapple, coconut]

(not a list presented in your post).

To convert such a DataFrame to a single, flat list, you can run:

lst = df['fruits'].apply(pd.Series).stack().drop_duplicates().to_list()

It in an "ordinary" (pythonic) list.

To create a second DataFrame with a single column, run:

df2 = pd.DataFrame(lst, columns=['fruits'])

Another option, without creation of an intermediate list:

df['fruits'].apply(pd.Series).stack().rename('fruits')\
    .drop_duplicates().reset_index(drop=True).to_frame()

Edit 2

I found a simpler solution, taking into account that read_excel reads by default all cells as strings.

The key to success is str.extractall method, applied to fruits column. To extract the text between apostrophes, the regex should be:

'(?P<fruits>[^']+)'

Details:

  • ' - An apostrophe (represents itself), before the text to match.
  • (?P<fruits> - Start of a named capturing group (called also fruits).
  • [^']+ - The content of this group - a non-empty sequence of chars other than an apostrophe.
  • ) - End of the capturing group.
  • ' - Another apostrophe, after the text to match.

So if you run:

df.fruits.str.extractall(r"'(?P<fruits>[^']+)'")

you will get:

            fruits
  match           
0 0          apple
  1         orange
  2          grape
1 0          apple
  1      pineapple
  2        coconut

This result contains:

  • A MultiIndex:
    • top level - the index of the source row (with no name),
    • second level - match number (0, 1 and 2 for each row).
  • fruits - the name of the capturing group with individual strings in consecutive rows.

Now it remains only to drop duplicates and reset the index.

So the complete code, a single instruction is:

df.fruits.str.extractall("'(?P<fruits>[^']+)'")\
    .drop_duplicates().reset_index(drop=True)

The result is:

      fruits
0      apple
1     orange
2      grape
3  pineapple
4    coconut

9 Comments

i facing one last problem. the 'data' that i post, is a 'list', meaning type(data) equal list. But when I pd.read_excel, the data type become dataframe, where i am stuck. when I convert it to a list, the 'columns=['fruit']' will be lost. thus unable to extract unique value. please advice
In your post you didn't write about read_excel. So now it looks like you wanted to convert a DataFrame to a single column. Run print(df) from what you have read from Excel and add to your post.
fruits 0 ['apple', 'orange','grape'] 1 ['apple','pineapple','coconut'] pandas.core.frame.DataFrame
sorry about didn't mention in beginning, as i want to post the 'data' without knowing the difference. \ But now, i am lost. after i [df = pd.read_excel], i follow up with your 'another option' code. the result is\ fruits 0 ['apple', 'orange','grape'] 1 ['apple','pineapple','coconut'] which part did i miss?
fruits should be the name of (the only) column. Maybe in your DataFrame it is the first level of MultiIndex?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.