How to solve an error with pivot tables in python

Question

I have a similar file like this one:

    movieId     title   genres  userId  rating  timestamp

0   1   Toy Story (1995)    Adventure|Animation|Children|Comedy|Fantasy     1   4.0     964982703
1   1   Toy Story (1995)    Adventure|Animation|Children|Comedy|Fantasy     5   4.0     847434962
2   1   Toy Story (1995)    Adventure|Animation|Children|Comedy|Fantasy     7   4.5     1106635946
3   1   Toy Story (1995)    Adventure|Animation|Children|Comedy|Fantasy     15  2.5     1510577970
4   1   Toy Story (1995)    Adventure|Animation|Children|Comedy|Fantasy     17  4.5     1305696483
5   6   Heat (1995)     Action|Crime|Thriller   373     5.0     846830247
6   6   Heat (1995)     Action|Crime|Thriller   380     5.0     1494278663
7   6   Heat (1995)     Action|Crime|Thriller   385     3.0     840648313
8   6   Heat (1995)     Action|Crime|Thriller   386     3.0     842613783
9   6   Heat (1995)     Action|Crime|Thriller   389     5.0     857934242

I ran this code to obtain the full data and to process it:

! wget https://www.dropbox.com/s/z4zoofdgdrxe01r/movies.csv
! wget https://www.dropbox.com/s/f328xczt6vju6hi/ratings.csv
import pandas as pd
df_movies = pd.read_csv('movies.csv')
df_ratings = pd.read_csv('ratings.csv')
df_merged=pd.merge(df_movies, df_ratings, how='inner')

this is the code with I have issues:

df_merged.pivot(index='movieId', columns='title', values='rating')

I got:

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-74-ad6b3a589ea8> in <module>()
----> 1 df_merged.pivot(index='movieId', columns='title', values='rating')

5 frames

/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/reshape.py in _make_selectors(self)
    177 
    178         if mask.sum() < len(self.index):
--> 179             raise ValueError("Index contains duplicate entries, cannot reshape")
    180 
    181         self.group_index = comp_index

ValueError: Index contains duplicate entries, cannot reshape

What I want is to know which movie has more votes by doing a resume table like a Dynamic Table in excel

Craig · Accepted Answer · 2020-09-26 15:29:53Z

1

The most straightforward method to get the counts of groups is to use the DataFrame.value_counts() method that was introduced with pandas 1.1. For earlier versions of pandas a similar result can be achieved by using the Series.value_counts() method. Other alternatives include DataFrame.gropuby() and DataFrame.pivot_table(). These might be preferred if you want to aggregate the data using multiple criteria beyond just counting the number of items.

Setup

import pandas as pd

df_merged = pd.DataFrame({'movie id': [1, 1, 1, 1, 1, 6, 6, 6, 6, 6], 
                  'title': ['Toy Story (1995)', 'Toy Story (1995)', 'Toy Story (1995)','Toy Story (1995)', 'Toy Story (1995)', 'Heat (1995)', 'Heat (1995)', 'Heat (1995)', 'Heat (1995)', 'Heat (1995)'], 
                  'rating': [4.0, 4.0, 4.5, 2.5, 4.5, 5.0, 5.0, 3.0, 3.0, 5.0]})

value_counts()

To get the number of votes, use .value_counts() to count the number of items:

df_merged.value_counts('title')

This will return a new series that has the titles of the movies as the index and the number of ratings on each movie as the values.

Heat (1995)         5
Toy Story (1995)    5
Name: title, dtype: int64

For versions of pandas before 1.1, you can use .value_counts() on a Series to get a similar result:

df_merged['title'].value_counts()

groupby

Another approach is to use .gropuby() with .size():

df_merged.groupby('title').size()

pivot_table()

This can also be done using the .pivot_table() method:

df_merged.pivot_table(values='rating', index=['title'], aggfunc='count')

Which produces a DataFrame as output:

               rating
title   
Heat (1995)         5
Toy Story (1995)    5

The pivot_table approach could be useful if you wanted to aggregate using multiple critera, for example, the number of ratings and the average (mean) rating:

df_merged.pivot_table(values='rating', index=['title'], aggfunc=('count','mean'))

                  count  mean
title                        
Heat (1995)           5   4.2
Toy Story (1995)      5   3.9

edited Sep 26, 2020 at 15:29

answered Sep 25, 2020 at 2:53

Craig

4,8751 gold badge20 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Another.Chemist Over a year ago

I got AttributeError: 'DataFrame' object has no attribute 'value_counts'

Craig Over a year ago

This works for pandas >= 1.1. Are you using an older version of pandas?

Craig Over a year ago

I just opened a new notebook in colab and ran: import pandas; pandas.__version__ and got 1.0.5. I'll update my answer to work for this version.

Craig Over a year ago

The third was just a typo on my part. I had 'titles' but your column is 'title'. It's fixed now.

Craig Over a year ago

The DataFrame.pivot() command just re-shapes a DataFrame, it doesn't aggregate any of the data. I've updated my answer to show how this can be done using DataFrame.pivot_table(), which is similar to Pivot Tables in Excel and may be what you were trying to do with pivot. If this isn't the output that you want, please update your question with a clear example of what you expect the output to look like.

|

Collectives™ on Stack Overflow

How to solve an error with pivot tables in python

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related