plotting a pandas dataframe column which contains NaN values

Question

I'm having some issues plotting a second column from a pandas dataframe onto a twinx y-axis. I think it might be because the second problematic column contains NaN values. The NaN values are there because there was only data available every 10th year, although for the first column there was data available every year. They were generated in using np.nan which I included at the end for clarity.

The intuition here is to plot both series on the same x-axis to show how they trend over time.

Here's my code and dataframe:

import pandas as pd
import numpy as np
import matplotlib as plt
import matplotlib.pyplot as plt

list1 = ['1297606', '1300760', '1303980', '1268987', '1333521', '1328570', 
         '1328112', '1353671', '1371285', '1396658', '1429247', '1388937', 
         '1359145', '1330414', '1267415', '1210883', '1221585', '1186039', 
         '884273', '861789', '857475', '853485', '854122', '848163', '839226', 
         '820151', '852385', '827609', '825564', '789217', '765651']

list1a = [1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 
          1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 
          2004, 2005, 2006, 2007, 2008, 2009, 2010]

list3b = [121800016.0, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 
          145279588.0, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 
          160515434.5, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 
          168140487.0]

d = {'Year': list1a,'Abortions per Year': list1, 
     'Affiliation with Religious Institutions': list3b}
newdf = pd.DataFrame(data=d)

newdf.set_index('Year',inplace=True)

fig, ax1 = plt.subplots(figsize=(20,5))

y2min = min(newdf['Affiliation with Religious Institutions'])
y2max = max(newdf['Affiliation with Religious Institutions'])
ax1.plot(newdf['Abortions per Year'])
#ax1.set_xticks(newdf.index)
ax1b = ax1.twinx()
ax1b.set_ylim(y2min*0.8,y2max*1.2)
ax1b.plot(newdf['Affiliation with Religious Institutions'])
plt.show()

I end up with a chart which doesn't show the second plot. (When I changed the second plot to have numeric values for each year, it plots it). Here's the second plot (with NaN values) -- being ignored:

Grateful for any advice.

*how the np.nan values were generated for the second column: I looped thru the index column and for every year without data, returned np.nan to the list, which was then made a column.

for i in range(len(list1a)):
    if list1a[i] in list3a:
        var = list2[j]
        list3b.append(var)

        j+=1
    else:
        var = np.nan
        list3b.append(var)

@James thanks for edit, I pasted the list with nan (not np.nan) as it was printed — ZakS
– ZakS, Commented May 23, 2018 at 12:36

James · Accepted Answer · 2018-05-23 12:40:47Z

3

Two things. You need to convert the Abortions per Year column to a numeric type for plotting, at least for the data you provided which is in str format; second, you can plot Affiliation with Religious Institutions as a line by dropping the nan values before plotting.

ax1.plot(newdf['Abortions per Year'].astype(int))

...

ax1b.plot(newdf['Affiliation with Religious Institutions'].dropna())

answered May 23, 2018 at 12:40

James

37k4 gold badges54 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

JE_Muc Over a year ago

One of the principles of numpy and thus also pandas is: int for indexing, float for data. Thus your first line should be: ax1.plot(newdf['Abortions per Year'].astype(float))

James Over a year ago

Integer values for data are completely acceptable. Floating point operations are computationally more expensive, so keeping data that is an integer as an integer is a good idea.

JE_Muc Over a year ago

This is true as long as you have small values. But you never know which calculations will be performed. Considering the numbers used in this example, maximum about 1.68e8, int operations can be critical. Just try np.array(newdf.max().astype(int))**2 and np.array(newdf.max().astype(float))**2.

JE_Muc Over a year ago

This is the reason for the principle of: int for indexing, float for data. No one needs to follow it, it is not mandatory. But it is highly recommended. It is the same like with the PEP-styleguide and the zen of python. It is not mandatory but there are really good reasons to follow the advices.

James Over a year ago

Where have you run across this for pandas or numpy?

|

JE_Muc · Accepted Answer · 2018-05-23 12:58:35Z

3

You can use pandas DataFrame methods for most of the things that you are doing. These two lines will solve all of your problems:

newdf = newdf.astype(float)
newdf = newdf.interpolate(method='linear')

So your code for plotting will look like this:

fig, ax1 = plt.subplots(figsize=(20,5))

newdf = newdf.astype(float)
newdf = newdf.interpolate(method='linear')
y2min = newdf['Affiliation with Religious Institutions'].min()
y2max = newdf['Affiliation with Religious Institutions'].max()
newdf['Abortions per Year'].plot.line(ax=ax1)
#ax1.set_xticks(newdf.index)
ax1b = ax1.twinx()
ax1b.set_ylim(y2min*0.8,y2max*1.2)
newdf['Affiliation with Religious Institutions'].plot.line(ax=ax1b)
plt.show()

Using the pandas methods for plotting a DataFrame is just a recommendation. But you can also use your matplotlib code, since pandas uses matplotlib as a plotting backend

The two lines that I added do the following:
Your column Abortions per Year is of dtype object. You need to convert this to a numeric type with:

newdf = newdf.astype(float)

In fact the NaN-values are not ignored, but not shown since they are single values. Thus you can add a marker to the second plot. If you want to show a line for the second plot, you need to interpolate the values with:

newdf = newdf.interpolate(method='linear')

Markers can be removed if interpolation is done.

edited May 23, 2018 at 12:58

answered May 23, 2018 at 12:25

JE_Muc

5,8323 gold badges30 silver badges49 bronze badges

4 Comments

ZakS Over a year ago

Thansk @Scotty 1- what I am looking for is two lines. Which of the newdf conversions should I use?

JE_Muc Over a year ago

Both do two different things. newdf = newdf.astype(float) is needed to convert to float format for correct plotting type. If you want to use newdf = newdf.interpolate(method='linear') depends on if you want markers only at the sports where the Affiliation is known or if you want to plot an interpolated line.

JE_Muc Over a year ago

I updated my post so it has both solutions included.

ZakS Over a year ago

Thanks very much @Scotty1- , that was super-helpful

GeorgeLPerkins · Accepted Answer · 2018-05-23 13:01:10Z

2

I understand now. To achieve that with your existing code, you simply need to use Pandas forwardfill.

Right after

newdf.set_index('Year',inplace=True)

Just put

newdf.fillna(method='ffill', inplace=True)

answered May 23, 2018 at 13:01

GeorgeLPerkins

1,14611 silver badges24 bronze badges

2 Comments

ZakS Over a year ago

Dear @GeorgeLPerkins, the solution is accurate but for totally aesthetic purposes I like the way the line chart has a gradient between the data points.

GeorgeLPerkins Over a year ago

Ah, I see. The gradient does look more pleasing than the stair step.

GeorgeLPerkins · Accepted Answer · 2018-05-23 12:37:56Z

1

A basic thing going wrong here is you are plotting a point as a line.

list3b = [121800016.0, nan, nan....... Goes from one point to nothing.

If you change the second nan to a value: list3b = [121800016.0, 121800016.0, nan, ..... then you will see a result.

Maybe you should plot those values as bars or scatter points.

answered May 23, 2018 at 12:37

GeorgeLPerkins

1,14611 silver badges24 bronze badges

2 Comments

ZakS Over a year ago

what I'd like to get here is lines between all the values that exist. So for the second column, there would be lines connecting each point at every 10th year, and for the first column, there would be lines connecting the points at every single year.

JE_Muc Over a year ago

@ZakS: I posted a solution which does exactly what you want.

Collectives™ on Stack Overflow

plotting a pandas dataframe column which contains NaN values

4 Answers 4

6 Comments

4 Comments

2 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

4 Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related