0

I'm having some issues plotting a second column from a pandas dataframe onto a twinx y-axis. I think it might be because the second problematic column contains NaN values. The NaN values are there because there was only data available every 10th year, although for the first column there was data available every year. They were generated in using np.nan which I included at the end for clarity.

The intuition here is to plot both series on the same x-axis to show how they trend over time.

Here's my code and dataframe:

import pandas as pd
import numpy as np
import matplotlib as plt
import matplotlib.pyplot as plt

list1 = ['1297606', '1300760', '1303980', '1268987', '1333521', '1328570', 
         '1328112', '1353671', '1371285', '1396658', '1429247', '1388937', 
         '1359145', '1330414', '1267415', '1210883', '1221585', '1186039', 
         '884273', '861789', '857475', '853485', '854122', '848163', '839226', 
         '820151', '852385', '827609', '825564', '789217', '765651']

list1a = [1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 
          1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 
          2004, 2005, 2006, 2007, 2008, 2009, 2010]

list3b = [121800016.0, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 
          145279588.0, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 
          160515434.5, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 
          168140487.0]

d = {'Year': list1a,'Abortions per Year': list1, 
     'Affiliation with Religious Institutions': list3b}
newdf = pd.DataFrame(data=d)

newdf.set_index('Year',inplace=True)

fig, ax1 = plt.subplots(figsize=(20,5))

y2min = min(newdf['Affiliation with Religious Institutions'])
y2max = max(newdf['Affiliation with Religious Institutions'])
ax1.plot(newdf['Abortions per Year'])
#ax1.set_xticks(newdf.index)
ax1b = ax1.twinx()
ax1b.set_ylim(y2min*0.8,y2max*1.2)
ax1b.plot(newdf['Affiliation with Religious Institutions'])
plt.show()

I end up with a chart which doesn't show the second plot. (When I changed the second plot to have numeric values for each year, it plots it). Here's the second plot (with NaN values) -- being ignored:

enter image description here

Grateful for any advice.

*how the np.nan values were generated for the second column: I looped thru the index column and for every year without data, returned np.nan to the list, which was then made a column.

for i in range(len(list1a)):
    if list1a[i] in list3a:
        var = list2[j]
        list3b.append(var)

        j+=1
    else:
        var = np.nan
        list3b.append(var)
1
  • @James thanks for edit, I pasted the list with nan (not np.nan) as it was printed Commented May 23, 2018 at 12:36

4 Answers 4

3

Two things. You need to convert the Abortions per Year column to a numeric type for plotting, at least for the data you provided which is in str format; second, you can plot Affiliation with Religious Institutions as a line by dropping the nan values before plotting.

ax1.plot(newdf['Abortions per Year'].astype(int))

...

ax1b.plot(newdf['Affiliation with Religious Institutions'].dropna())
Sign up to request clarification or add additional context in comments.

6 Comments

One of the principles of numpy and thus also pandas is: int for indexing, float for data. Thus your first line should be: ax1.plot(newdf['Abortions per Year'].astype(float))
Integer values for data are completely acceptable. Floating point operations are computationally more expensive, so keeping data that is an integer as an integer is a good idea.
This is true as long as you have small values. But you never know which calculations will be performed. Considering the numbers used in this example, maximum about 1.68e8, int operations can be critical. Just try np.array(newdf.max().astype(int))**2 and np.array(newdf.max().astype(float))**2.
This is the reason for the principle of: int for indexing, float for data. No one needs to follow it, it is not mandatory. But it is highly recommended. It is the same like with the PEP-styleguide and the zen of python. It is not mandatory but there are really good reasons to follow the advices.
Where have you run across this for pandas or numpy?
|
3

You can use pandas DataFrame methods for most of the things that you are doing. These two lines will solve all of your problems:

newdf = newdf.astype(float)
newdf = newdf.interpolate(method='linear')

So your code for plotting will look like this:

fig, ax1 = plt.subplots(figsize=(20,5))

newdf = newdf.astype(float)
newdf = newdf.interpolate(method='linear')
y2min = newdf['Affiliation with Religious Institutions'].min()
y2max = newdf['Affiliation with Religious Institutions'].max()
newdf['Abortions per Year'].plot.line(ax=ax1)
#ax1.set_xticks(newdf.index)
ax1b = ax1.twinx()
ax1b.set_ylim(y2min*0.8,y2max*1.2)
newdf['Affiliation with Religious Institutions'].plot.line(ax=ax1b)
plt.show()

Using the pandas methods for plotting a DataFrame is just a recommendation. But you can also use your matplotlib code, since pandas uses matplotlib as a plotting backend

The two lines that I added do the following:
Your column Abortions per Year is of dtype object. You need to convert this to a numeric type with:

newdf = newdf.astype(float)

In fact the NaN-values are not ignored, but not shown since they are single values. Thus you can add a marker to the second plot. If you want to show a line for the second plot, you need to interpolate the values with:

newdf = newdf.interpolate(method='linear')

Markers can be removed if interpolation is done.

4 Comments

Thansk @Scotty 1- what I am looking for is two lines. Which of the newdf conversions should I use?
Both do two different things. newdf = newdf.astype(float) is needed to convert to float format for correct plotting type. If you want to use newdf = newdf.interpolate(method='linear') depends on if you want markers only at the sports where the Affiliation is known or if you want to plot an interpolated line.
I updated my post so it has both solutions included.
Thanks very much @Scotty1- , that was super-helpful
2

enter image description hereI understand now. To achieve that with your existing code, you simply need to use Pandas forwardfill.

Right after

newdf.set_index('Year',inplace=True)

Just put

newdf.fillna(method='ffill', inplace=True)

2 Comments

Dear @GeorgeLPerkins, the solution is accurate but for totally aesthetic purposes I like the way the line chart has a gradient between the data points.
Ah, I see. The gradient does look more pleasing than the stair step.
1

A basic thing going wrong here is you are plotting a point as a line.

list3b = [121800016.0, nan, nan....... Goes from one point to nothing.

If you change the second nan to a value: list3b = [121800016.0, 121800016.0, nan, ..... then you will see a result. enter image description here

Maybe you should plot those values as bars or scatter points.

2 Comments

what I'd like to get here is lines between all the values that exist. So for the second column, there would be lines connecting each point at every 10th year, and for the first column, there would be lines connecting the points at every single year.
@ZakS: I posted a solution which does exactly what you want.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.