2

I've got a DataFrame of ~40,000 rows. The DataFrame looks roughly like:

             Unix Time                           UTC  Val. 1  Val. 2  Val. 3
1    1518544176.927486    2018-02-13 17:49:36.927486    5.00    0.25    2.00
2    1518544176.929897    2018-02-13 17:49:36.929897    4.50    1.00    3.00
3    1518544176.932310    2018-02-13 17:49:36.932310    4.00    0.75    1.75
...

Columns 0, 2-4 are type <class 'numpy.float64'>. Column 1 is type <class 'pandas._libs.tslib.Timestamp'>. When one plots any of the data columns against time, we see a waveform. However, there are occasional breaks in the acquisition. For instance, we might have:

               Unix Time                           UTC  Val. 1  Val. 2  Val. 3
576    1518544181.755085    2018-02-13 17:49:41.755085    0.10    0.01    0.93
577    1518544182.041129    2018-02-13 17:49:42.041129    0.11    0.02    0.95
...

As one can see, there's a ~0.3 s gap between readings 576 and 577. The problem is that when plotting the data, matplotlib connect the dots, even when there's no data. Solutions to this "problem" have been addressed in other questions on Stack Overflow and online at-large, and though I'm not in love with... well, any of them, the best option seems to be to insert NaNs into the data gaps. Since matplotlib doesn't plot NaNs, it's a sneaky way to trick it into making your plot more realistic.

To do this, I start by finding the time delta between the first two readings (this is safe), and using twice that value as my metric for "is there a gap?" I then iterate through the DataFrame, checking the gaps. Upon finding one I create a temporary row of NaNs in the data columns, and time values right in the middle of the acquisition gap in the time columns. I then modify a new DataFrame made of the old one, plus this row. This can be seem here:

df2 = df.copy()
for i, row in df.iterrows():
    # The following code checks the delta-t of all timestamp pairs.
    # We have i > 0 because it can't activate on the first entry.
    if i > 0:
        delta_t_unix = row['Unix Time'] - prev_timestamp_unix
        delta_t_utc = row['UTC'] - prev_timestamp_utc
        # If delta_t_unix > the allowed data gap, add new timestamps and NaNs.
        if delta_t_unix > allowed_gap:
            time_unix = row['Unix Time'] - (delta_t_unix / 2.0)
            time_utc = row['UTC'] - (delta_t_utc / 2.0)
            val1 = np.nan
            val2 = np.nan
            val3 = np.nan
            new_row = pd.DataFrame({'Unix Time': time_unix, 'UTC': time_utc,
                                    'Val. 1': val1, 'Val. 2': val2,
                                    'Val. 3': val3}, index = [i])
            df2 = pd.concat([df2.ix[:i-1], new_row,
                            df2.ix[i:]]).reset_index(drop = True)
    # Set the previous timestamp for use in the beginning of the loop.
    prev_timestamp_unix = row[timestamp_unix]
    prev_timestamp_utc = row[timestamp_utc]
# Make the final DataFrame with the completed lists.
df2 = df2[['Unix Time', 'UTC', 'Val. 1', 'Val. 2', 'Val. 3']]

This currently takes ~4.5 seconds, thanks to this question (it used to take ~6.5, as I was foolishly iterating through and creating new lists of each column, and then creating a new DataFrame out of those). However, that's still far slower than I'd expect or prefer. Does anyone have any ideas on how to speed this up? I'm still quite new to Pandas and DataFrames, so I'm sure this could be better. Thanks!

EDIT: Worth mentioning that if I remove the datetime column, it splits the time in half (though unfortunately I can't remove that in practice).

3
  • Actually, I am pretty sure pandas has a built-in function for what you want, I am checking this out and I'll post it for you Commented Feb 19, 2018 at 14:16
  • Didn't find anything that would produce exactly what you have built. I was actually mistaken thinking that resampling could to it. Commented Feb 19, 2018 at 14:25
  • I've corrected some errors in my post, and it should work now. If you can share how long that code took to run, I would appreciate that. Commented Feb 19, 2018 at 14:44

2 Answers 2

2

You can resample to 2.4ms using something like:

df['utc_time'] = pd.to_datetime(df['utc_time'])
df.set_index(df['utc_time'])[['val1','val2','val3']].resample('2.4ms', loffset='1.2ms').mean().reset_index()
Sign up to request clarification or add additional context in comments.

8 Comments

Not a bad idea at all, but unfortunately I don't think resampling is what I'm looking for (I think, unless I misunderstand what it does). I'd like to be able to see every last bit of the data, including the gaps.
If you resample an irregularly sampled waveform onto the smallest time step, which appears to be 2.4ms in the data you've shown, you won't lose any data.
The smallest time step will change from dataset to dataset, and there's some jitter within each dataset as well. Granted, that's all easy to account for, but the base point is that I actually need to visualize the gaps in the data, which this method removes (at least, I tried what you posted and it removed the gaps).
No worries, I am trying to be helpful. Are you using an IDE? If you run those two lines, what displays in the console? You just get the input back? If you want to save the result, you will have to add df1= to the front of the second line.
... That said, it is much faster than what we did above (fractions of seconds at this point, but still). I'm going to leave @joaoavf's as the answer for now, because it perfectly retains the original data, but I will definitely be exploring this. Thank you for your help!
|
1

Somethings that can speed that up:

  1. Use df.itertuples() instead of df.iterrows() and it will probably improve somewhat the execution time.

If you can post the improvements I would appreciate, but in my experience it was a huge difference in the cases I tested (10x improvement for much simpler instructions inside the loop).

  1. Instead of using pd.concat to put rows together, create a list of tuples and then generate a DataFrame only after the loop from this list.

    for i, unix_time, utc_time, val1, val2, val3 in df.itertuples():
    
        list_of_values = []
        if i > 0:
            delta_t_unix = unix_time - prev_timestamp_unix
            delta_t_utc = utc_time - prev_timestamp_utc
    
            if delta_t_unix > allowed_gap:
                new_time_unix = unix_time - (delta_t_unix / 2.0)
                new_time_utc = utc_time - (delta_t_utc / 2.0)
                list_of_values.append((new_time_unix, new_time_utc, np.nan, np.nan, np.nan))
    
        # Set the previous timestamp for use in the beginning of the loop.
        prev_timestamp_unix = unix_time 
        prev_timestamp_utc = utc_time 
    
        list_of_values.append(((unix_time , utc_time, val1, val2, val3))
    
    df2 = pd.DataFrame(list_of_values, columns=['Unix Time', 'UTC', 'Val. 1', 'Val. 2', 'Val. 3'])
    

That will probably speed up things quite a lot.

7 Comments

I'd never heard of itertupes. Thank you! I'm running tests now.
@erekalper, it was missing this line list_of_values.append(((new_time_unix, new_time_utc, val1, val2, val3)) to fill the old values as well on the new DataFrame
Yeah, I noticed :). Thanks for editing! (And technically the values in that new appended list should be unix_time and utc_time, but I know what you meant.)
@erekalper, I think using a queue or deque might speed up things compared to the list, but as the dataset is not huge, I think it will be a really slight improvement that might not be worth implementing.
Your method (1) brought it down to ~1 s, and (2) brought it down to ~0.75 s. This is unbelievably helpful, and I will definitely be using itertuples in the future!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.