Finding count of duplicate values and ordering in a Pandas dataframe

Question

I have a Pandas dataframe with many columns, two of them are "movie title" and "age", I want to find the top 5 movies with the lowest average age of the people that rated them, but only include movies which have at least 100 ratings (so appear in at least 100 rows).

For example:

movie title      age

Title 1          10
Title 2          12
Title 2          12
Title 3          13
Title 3          13
Title 3          13

Should become:

movie title     # of ratings     avg age

Title 1         1                    10
Title 2         2                    12
Title 3         3                    13

It can be in the same or a new dataframe. Thanks for your help!

Ami Tavory · Accepted Answer · 2016-04-02 20:19:28Z

4

Say you do

agg = df.age.groupby([df['movie title']]).agg({'ave_age': 'mean', 'size': 'size'})

You'll get a DataFrame with columns ave_age and size.

agg[agg['size'] > 100]

will give you only those that have more than 100 users. From there, sort by agg.ave_age and take the top 5. It should look something like this:

agg[agg['size'] > 100].sort_values(by='ave_age', ascending=True).head(5)

edited Apr 2, 2016 at 20:19

answered Apr 2, 2016 at 19:55

Ami Tavory

76.7k13 gold badges152 silver badges196 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Alexander Over a year ago

Have a look at the groupby. I believe df.title throws an error and is not needed anyway. Also, size is a dataframe property, so you may want to use a different variable name. ascending=True is the default parameter value so is not required, but it doesn't hurt to be explicit either.

Ami Tavory Over a year ago

Thanks, @Alexander you're right - corrected. I believe that pd.read_clipboard() does strange stuff on this particular example. Consequently, I made some modifications, and apparently got one wrong.

Alexander Over a year ago

I believe you want agg = df.groupby('movie title').age.agg(...)

Ami Tavory Over a year ago

Thanks, guys. @Alexander, unfortunately, I could only upvote your answer once.

MaxU - stand with Ukraine Over a year ago

I do like both solutions, so i wanted to compare how fast they are: Ami's - 100 loops, best of 3: 6.56 ms per loop, Alexander's - 100 loops, best of 3: 16.9 ms per loop

|

Alexander · Accepted Answer · 2016-04-06 03:51:30Z

3

The filter creates a flag for each movie that is set to True if the movie title count is more than one hundred and False otherwise.

n = 100
filter = (df.groupby(['movie title'])['age']
          .transform(lambda group: group.count()) >= n)

Given the small size of your sample data, I will set n to be 2 and create my filter.

Now I just filter on movies with a count exceeding n, calculate the average age per group, and then take the five smallest (i.e. lowest age).

>>> df[filter.values].groupby('movie title').age.mean().nsmallest(5)
movie title
Title 2    12
Title 3    13
Name: age, dtype: int64

edited Apr 6, 2016 at 3:51

answered Apr 2, 2016 at 19:58

Alexander

111k32 gold badges212 silver badges208 bronze badges

3 Comments

ℕʘʘḆḽḘ Over a year ago

couldnt you use the 'filter' method after groupby directly?

user2453297 Over a year ago

This doesn't seem to filter out movies with less than 100 ratings when I run it?

Alexander Over a year ago

It works if the dataframe only has the columns movie title and age. See edit above for fix.

Collectives™ on Stack Overflow

Finding count of duplicate values and ordering in a Pandas dataframe

2 Answers 2

9 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related