Pandas: interpolate missing rows and plot multiple series in dataframe

Question

I'm looking for pointers to the appropriate docs for accomplishing the analysis task described below with pandas in pylab. I've previously written python + matplotlib functions that accomplish much of this, but the resulting code is slow and cumbersome to maintain. It seems like pandas has the capabilities needed but I'm getting bogged down trying to find the right approach and functions.

In [1]: import pandas as pd

In [6]: df = pd.read_csv("tinyexample.csv", parse_dates=2)

In [7]: df
Out[7]: 
   I                  t       A      B        C     D        E
0  1  08/06/13 02:34 PM  109.40  105.50  124.30  1.00  1930.95
1  1  08/06/13 02:35 PM  110.61  106.21  124.30  0.90  1964.89
2  1  08/06/13 02:37 PM  114.35  108.84  124.30  0.98  2654.33
3  1  08/06/13 02:38 PM  115.38  109.81  124.30  1.01  2780.63
4  1  08/06/13 02:40 PM  116.08  110.94  124.30  0.99  2521.28
5  4  08/06/13 02:34 PM  105.03  100.96  127.43  1.12  2254.51
6  4  08/06/13 02:35 PM  106.73  101.72  127.43  1.08  2661.76
7  4  08/06/13 02:38 PM  111.21  105.17  127.38  1.06  3163.07
8  4  08/06/13 02:40 PM  111.69  106.28  127.38  1.09  2898.73

The above is a tiny slice of minute-by-minute readings from a network of radio-connected data loggers. The sample shows ouput from 2 loggers over a 10 minute period. The actual data files have output from dozens of loggers over multiple days.

Column 'I' is the logger id, 't' is a timestamp, 'A-C' are temperatures, 'D' is a flow rate, and 'E' is an energy rate computed from A, B, and D.

Because of poor radio connectivity there are missing readings in all loggers at random times.

Specifically, I want to do something like the following

for i in I:
    ## Insert rows for all missing timestamps with interpolated values for A through E
    ## Update a new column 'F' with a cumulative sum of 'E' (actually E/60)

Then I want to be able to define a plotting function that allows me to output vertically-aligned strip-chart plots similar to those shown in the docs at http://pandas.pydata.org/pandas-docs/dev/visualization.html. I've tried

df.plot(subplots=True, sharex=True)

which almost does what I need, except that

It plots by index number rather than by date.
It doesn't create individual plot lines for each logger id.

plot results

Finally, I'd want to be able to choose a subset of the logger id's and data columns to plot, e.g.

def myplot(df, ilist, clist):
    """
    ilist is of the form [ n, m, p, ...] where n, m, and p are logger id's in column 'I'
    clist is a list of column labels.

    Produces stack of strip chart plots, one for each column contain plot lines for each id.
    """

SOLUTION (using Dan Allan's accepted answer -- thanks, Dan)

import pandas as pd
import matplotlib.pyplot as plt 

def myinterpolator(grp, cols = ['I', 'A', 'B', 'C', 'D', 'E']):
    index = pd.date_range(freq='1min', 
            start=grp.first_valid_index(), 
            end=grp.last_valid_index())
    g1  = grp.reindex(set(grp.index).union(index)).sort_index()
    for col in cols:
        g1[col] = g1[col].interpolate('time').ix[index]
    g1['F'] = g1['E'].cumsum()    
    return g1 


def myplot(df, ilist, clist):
    df1 = df[df['I'].isin(ilist)][clist + ['I']]
    fig, ax = plt.subplots(len(clist))
    for I, grp in df1.groupby('I'):
        for j, col in enumerate(clist):
            grp[col].plot(ax=ax[j], sharex=True)


df = pd.read_csv("tinyexample.csv", parse_dates=True, index_col=1)

df_interpolated = pd.concat([myinterpolator(grp) for I, grp in df.groupby('I')])
myplot(df_interpolated, ilist=[1,4], clist=['F', 'A', 'C'])
plt.tight_layout()

How do you want your values interpolated? Simple methods like filling forward the last valid option is supported. More advanced methods can be done, but there aren't convenience wrappers in pandas: there's an open issue with a bit of discussion about interpolation — TomAugspurger
– TomAugspurger, Commented Aug 21, 2013 at 16:49
As linked on my answer below, see this for one approach to resampling. Perhaps the issue Tom linked above will lead to a cleaner solution. — Dan Allan
– Dan Allan, Commented Aug 21, 2013 at 17:32

Community · Accepted Answer · 2017-05-23 12:16:30Z

2

Two pieces of this are tricky: interpolation (see Tom's comment) and your desire to plot different sensors in the same subplot. The subplots=True keyword is not sufficient for this subtlety; you have to use a loop. This works.

import matplotlib.pyplot as plt

def myplot(df, ilist, clist):
    df1 = df[df['I'].isin(ilist)][clist + ['t', 'I']].set_index('t')
    fig, ax = plt.subplots(len(clist))
    for I, grp in df1.groupby('I'):
        for j, col in enumerate(clist):
            grp[col].plot(ax=ax[j], sharex=True)

Usage:

df['t'] = pd.to_datetime(df['t']) # Make sure pandas treats t as times.
myplot(df, [1, 4], ['A', 'B', 'C'])
plt.tight_layout() # cleans up the spacing of the plots

enter image description here

You may not actually need interpolation. The above executes even if some data is missing, and the plot lines visually interpolate the data linearly. But if you want actual interpolation -- say for additional analysis -- see this answer.

edited May 23, 2017 at 12:16

CommunityBot

11 silver badge

answered Aug 21, 2013 at 17:27

Dan Allan

35.5k6 gold badges72 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mike Ellis Over a year ago

The plotting works nicely. Thanks! I do need the interpolation to get as much accuracy as possible when summing the energy production, 'E' unless pandas has a built in integrator function that's mathematically equivalent. I looked at your answer for interpolation. Do I need to break the df into separate time series for each logger before interpolating and then stitch them back together?

Dan Allan Over a year ago

Yes. For example, df_interpolated = pd.concat([f(grp) for I, grp in df.groupby('I')]) where f is some function that interpolates your data.

Collectives™ on Stack Overflow

Pandas: interpolate missing rows and plot multiple series in dataframe

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related