0

I have a similar file like this one:

    movieId     title   genres  userId  rating  timestamp

0   1   Toy Story (1995)    Adventure|Animation|Children|Comedy|Fantasy     1   4.0     964982703
1   1   Toy Story (1995)    Adventure|Animation|Children|Comedy|Fantasy     5   4.0     847434962
2   1   Toy Story (1995)    Adventure|Animation|Children|Comedy|Fantasy     7   4.5     1106635946
3   1   Toy Story (1995)    Adventure|Animation|Children|Comedy|Fantasy     15  2.5     1510577970
4   1   Toy Story (1995)    Adventure|Animation|Children|Comedy|Fantasy     17  4.5     1305696483
5   6   Heat (1995)     Action|Crime|Thriller   373     5.0     846830247
6   6   Heat (1995)     Action|Crime|Thriller   380     5.0     1494278663
7   6   Heat (1995)     Action|Crime|Thriller   385     3.0     840648313
8   6   Heat (1995)     Action|Crime|Thriller   386     3.0     842613783
9   6   Heat (1995)     Action|Crime|Thriller   389     5.0     857934242

I ran this code to obtain the full data and to process it:

! wget https://www.dropbox.com/s/z4zoofdgdrxe01r/movies.csv
! wget https://www.dropbox.com/s/f328xczt6vju6hi/ratings.csv
import pandas as pd
df_movies = pd.read_csv('movies.csv')
df_ratings = pd.read_csv('ratings.csv')
df_merged=pd.merge(df_movies, df_ratings, how='inner')

this is the code with I have issues:

df_merged.pivot(index='movieId', columns='title', values='rating')

I got:

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-74-ad6b3a589ea8> in <module>()
----> 1 df_merged.pivot(index='movieId', columns='title', values='rating')

5 frames

/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/reshape.py in _make_selectors(self)
    177 
    178         if mask.sum() < len(self.index):
--> 179             raise ValueError("Index contains duplicate entries, cannot reshape")
    180 
    181         self.group_index = comp_index

ValueError: Index contains duplicate entries, cannot reshape

What I want is to know which movie has more votes by doing a resume table like a Dynamic Table in excel

1 Answer 1

1

The most straightforward method to get the counts of groups is to use the DataFrame.value_counts() method that was introduced with pandas 1.1. For earlier versions of pandas a similar result can be achieved by using the Series.value_counts() method. Other alternatives include DataFrame.gropuby() and DataFrame.pivot_table(). These might be preferred if you want to aggregate the data using multiple criteria beyond just counting the number of items.

Setup

import pandas as pd

df_merged = pd.DataFrame({'movie id': [1, 1, 1, 1, 1, 6, 6, 6, 6, 6], 
                  'title': ['Toy Story (1995)', 'Toy Story (1995)', 'Toy Story (1995)','Toy Story (1995)', 'Toy Story (1995)', 'Heat (1995)', 'Heat (1995)', 'Heat (1995)', 'Heat (1995)', 'Heat (1995)'], 
                  'rating': [4.0, 4.0, 4.5, 2.5, 4.5, 5.0, 5.0, 3.0, 3.0, 5.0]})

value_counts()

To get the number of votes, use .value_counts() to count the number of items:

df_merged.value_counts('title')

This will return a new series that has the titles of the movies as the index and the number of ratings on each movie as the values.

Heat (1995)         5
Toy Story (1995)    5
Name: title, dtype: int64

For versions of pandas before 1.1, you can use .value_counts() on a Series to get a similar result:

df_merged['title'].value_counts()

groupby

Another approach is to use .gropuby() with .size():

df_merged.groupby('title').size()

pivot_table()

This can also be done using the .pivot_table() method:

df_merged.pivot_table(values='rating', index=['title'], aggfunc='count')

Which produces a DataFrame as output:

               rating
title   
Heat (1995)         5
Toy Story (1995)    5

The pivot_table approach could be useful if you wanted to aggregate using multiple critera, for example, the number of ratings and the average (mean) rating:

df_merged.pivot_table(values='rating', index=['title'], aggfunc=('count','mean'))

                  count  mean
title                        
Heat (1995)           5   4.2
Toy Story (1995)      5   3.9
Sign up to request clarification or add additional context in comments.

8 Comments

I got AttributeError: 'DataFrame' object has no attribute 'value_counts'
This works for pandas >= 1.1. Are you using an older version of pandas?
I just opened a new notebook in colab and ran: import pandas; pandas.__version__ and got 1.0.5. I'll update my answer to work for this version.
The third was just a typo on my part. I had 'titles' but your column is 'title'. It's fixed now.
The DataFrame.pivot() command just re-shapes a DataFrame, it doesn't aggregate any of the data. I've updated my answer to show how this can be done using DataFrame.pivot_table(), which is similar to Pivot Tables in Excel and may be what you were trying to do with pivot. If this isn't the output that you want, please update your question with a clear example of what you expect the output to look like.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.