4

I have a Pandas dataframe with many columns, two of them are "movie title" and "age", I want to find the top 5 movies with the lowest average age of the people that rated them, but only include movies which have at least 100 ratings (so appear in at least 100 rows).

For example:

movie title      age

Title 1          10
Title 2          12
Title 2          12
Title 3          13
Title 3          13
Title 3          13

Should become:

movie title     # of ratings     avg age

Title 1         1                    10
Title 2         2                    12
Title 3         3                    13

It can be in the same or a new dataframe. Thanks for your help!

2 Answers 2

4

Say you do

agg = df.age.groupby([df['movie title']]).agg({'ave_age': 'mean', 'size': 'size'})

You'll get a DataFrame with columns ave_age and size.

agg[agg['size'] > 100]

will give you only those that have more than 100 users. From there, sort by agg.ave_age and take the top 5. It should look something like this:

agg[agg['size'] > 100].sort_values(by='ave_age', ascending=True).head(5)
Sign up to request clarification or add additional context in comments.

9 Comments

Have a look at the groupby. I believe df.title throws an error and is not needed anyway. Also, size is a dataframe property, so you may want to use a different variable name. ascending=True is the default parameter value so is not required, but it doesn't hurt to be explicit either.
Thanks, @Alexander you're right - corrected. I believe that pd.read_clipboard() does strange stuff on this particular example. Consequently, I made some modifications, and apparently got one wrong.
I believe you want agg = df.groupby('movie title').age.agg(...)
Thanks, guys. @Alexander, unfortunately, I could only upvote your answer once.
I do like both solutions, so i wanted to compare how fast they are: Ami's - 100 loops, best of 3: 6.56 ms per loop, Alexander's - 100 loops, best of 3: 16.9 ms per loop
|
3

The filter creates a flag for each movie that is set to True if the movie title count is more than one hundred and False otherwise.

n = 100
filter = (df.groupby(['movie title'])['age']
          .transform(lambda group: group.count()) >= n)

Given the small size of your sample data, I will set n to be 2 and create my filter.

Now I just filter on movies with a count exceeding n, calculate the average age per group, and then take the five smallest (i.e. lowest age).

>>> df[filter.values].groupby('movie title').age.mean().nsmallest(5)
movie title
Title 2    12
Title 3    13
Name: age, dtype: int64

3 Comments

couldnt you use the 'filter' method after groupby directly?
This doesn't seem to filter out movies with less than 100 ratings when I run it?
It works if the dataframe only has the columns movie title and age. See edit above for fix.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.