1

So I have a dataframe that has a column like the following:

Fruit
apple;banana
pear;apple;peach
blueberry;durian;apple;peach
banana;grape;orange
.

and so on. I want to create an end list where I can get the following list:

fruitList = ['apple','banana','pear','apple','peach','blueberry','durian','peach','banana','grape','orange']

How would I do this? I managed to do this for a single row like the following:

 fruitList.extend(df['Fruit'].iloc[0].split(';'))
 #fruitList = ['apple','banana']

But of course, that only works for one row... how do I generalize this? My plan is just to count the fruit and get the top 10 fruit counts. My end goal is just to keep those rows that include a top 10 fruit... but to get there, how would I come up with fruitList in the first place?

5
  • iloc[0] refers to the first row. using a for loop you can generalize this. can you add more data? Commented Nov 11, 2017 at 23:25
  • @sera I guess I could do this with a loop over ever single dataframe row, but with a very large dataframe wouldn't this be slow? I was just wondering if there was an inbuilt way to do this in pandas if that makes sense. And yes, I can add more data examples Commented Nov 11, 2017 at 23:27
  • 1
    @sera In Python we avoid doing loop as much as possible. Alway search for a vectorized way of doing things. Dive into Stackoverflow looking for problems like yours or post a question about. Commented Nov 11, 2017 at 23:38
  • I see I was lazy and didn't read the entire question. Good work @sera. Commented Nov 11, 2017 at 23:50
  • @srodriguex my answer was an addition. good work too Commented Nov 11, 2017 at 23:53

2 Answers 2

2
df.Fruit.str.split(';').sum()

See full code in Microsft Azure Notebook.

Sign up to request clarification or add additional context in comments.

1 Comment

Didn't realize I could use sum() on lists like this thanks :)
1

In addition to srodriguex answer:

from collections import Counter

all = df.Fruit.str.split(';').sum()
c = Counter(all)
c.most_common(3)

Now if you want to get the rows:

df[df['Fruit'].str.contains("peach")]

and to get the indices:

list(df[df['Fruit'].str.contains("apple")].index)

Results

[('apple', 3), ('peach', 2), ('pear', 1)]


                         Fruit
1              pear;apple;peach
2  blueberry;durian;apple;peach


[1, 2]

2 Comments

@ocean800 I just modified my answer. see how you can get the rows
@ocean800 glad that i helped. see also my last modification. you can get the indices of the rows

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.