1

I am trying to figure out how could I plot this data:

column 1 ['genres']: These are the value counts for all the genres in the table

Drama              2453
Comedy             2319
Action             1590
Horror              915
Adventure           586
Thriller            491
Documentary         432
Animation           403
Crime               380
Fantasy             272
Science Fiction     214
Romance             186
Family              144
Mystery             125
Music               100
TV Movie             78
War                  59
History              44
Western              42
Foreign               9
Name: genres, dtype: int64

column 2 ['release_year']: These are the value counts for all the release years for different kind of genres

2014    699
2013    656
2015    627
2012    584
2011    540
2009    531
2008    495
2010    487
2007    438
2006    408
2005    363
2004    307
2003    281
2002    266
2001    241
2000    226
1999    224
1998    210
1996    203
1997    192
1994    184
1993    178
1995    174
1988    145
1989    136
1992    133
1991    133
1990    132
1987    125
1986    121
1985    109
1984    105
1981     82
1982     81
1983     80
1980     78
1978     65
1979     57
1977     57
1971     55
1973     55
1976     47
1974     46
1966     46
1975     44
1964     42
1970     40
1967     40
1972     40
1968     39
1965     35
1963     34
1962     32
1960     32
1969     31
1961     31
Name: release_year, dtype: int64

I need to answer the questions like - What genre is most popular from year to year? and so on

what kind of plots can be used and what is the best way to do this since there would be a lot of bins ins a single chart?

Is seaborn better for plotting such variables?

Should I divide the year data into 2 decades(1900 and 2000)?

Sample of the table: 
    id   popularity runtime genres  vote_count  vote_average    release_year
0   135397  32.985763   124 Action     5562     6.5             2015
1   76341   28.419936   120 Action     6185     7.1             1995
2   262500  13.112507   119 Adventure  2480     6.3             2015
3   140607  11.173104   136 Thriller   5292     7.5             2013
4   168259  9.335014    137 Action     2947     7.3             2005
3
  • Why not group your table by year and then count the genres? You should show us the a sample of the orginal table. Commented Mar 8, 2020 at 8:00
  • Sample added in the question Commented Mar 8, 2020 at 8:11
  • If you have a lot of genres maybe a lineplot is the way to go. The top lines will be the most popular genre. Just make sure to have a clear legend and use very distinct colours. Commented Mar 8, 2020 at 9:02

1 Answer 1

2

You could do something like this:

Plotting histogram using seaborn for a dataframe

Personally i prefer seaborn for this kind of plots, because it's easier. But you can use matplotlib too.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# sample data
samples = 300
ids = range(samples)
gind = np.random.randint(0, 4, samples)
years = np.random.randint(1990, 2000, samples)

# create sample dataframe
gkeys = {1: 'Drama', 2: 'Comedy', 3: 'Action', 4: 'Adventure', 0: 'Thriller'}
df = pd.DataFrame(zip(ids, gind, years),
                  columns=['ID', 'Genre', 'Year'])
df['Genre'] = df['Genre'].replace(gkeys)

# count the year groups
res = df.groupby(['Year', 'Genre']).count()
res = res.reset_index()

# only the max values
# res_ind = res.groupby(['Year']).idxmax()
# res = res.loc[res_ind['ID'].tolist()]

# viz
sns.set(style="white")
g = sns.catplot(x='Year',
                y= 'ID',
                hue='Genre',
                data=res,
                kind='bar',
                ci=None,
                   )
g.set_axis_labels("Year", "Count")
plt.show()

If this are to many bins in a plot, just split it up. Plot

Sign up to request clarification or add additional context in comments.

3 Comments

I think it's better to use the value count for years in y-axis and then represent each bin with the highest count of the genre for that particular year in x-axis, vote count should not be used or required for this comparison.
I've edited my post. If you uncomment the section, with the maxima you will see only one bar per year. But thel appearance of the bars fits not perfect yet. Maybe i change this tomorrow. Instead of a barplot you could create a heatmap.
Thanks for updating the code, I am new to the data analysts field and want to be good with the basics before I move forward to the advance level. I think for now bar chart would be sufficient.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.