0

I'm collecting time series data, but sometimes for some time points there is no data to be collected. Just say for example I am collecting data across four time points, I might get a dataframe like this:

df_ = pd.DataFrame({'group': ['A']*3+['B']*3,
                    'time': [1,2,4,1,3,4],
                    'value': [100,105,111,200,234,222]})

sometimes there is a datapoint missing and so there is no row for that point, I would like groupby and to forward fill with the previous value to create a new row form which would look like this:

df_missing_completed = pd.DataFrame({'group': ['A']*4+['B']*4,
                                     'time': [1,2,3,4,1,2,3,4],
                                     'value': [100, 101, 105,111,200, 202, 234,222]})

I had the idea that I could create an new dataframe as a template with all the dates and time points, without any values, join it with the real data which would induce NA's, and do a ffillon the value column to fill in the missing data, like below:

df_template = pd.DataFrame({'group': ['A']*4+['B']*4,
                                 'time': [1,2,3,4,1,2,3,4]})
df_final = pd.merge(df_template, df_, on = ['group', 'time'], how='left')
df_final['filled_values'] = df_final['value'].fillna(method='ffill')

but this seems like a messy solution, and with the real data the df_templete will be more complex to create. Does anyone know a better one? Thanks!

4
  • You don't really do a ffill here, can you clarify the logic? Commented Dec 1, 2022 at 14:05
  • you may want to have a look at reindex with method='ffill' Commented Dec 1, 2022 at 14:05
  • Thanks @mozway. I edited my question to clarify how it would work with a template dataframe and ffill. But I don't find it to be a satisfying solution. Commented Dec 1, 2022 at 14:14
  • your df_missing_completed doesn't correpsond to df_final, as mozway pointed out in the first comment Commented Dec 1, 2022 at 15:17

2 Answers 2

2

I would use:

(df_.pivot(index='time', columns='group', values='value')
    # reindex only of you want to add missing times for all groups
    .reindex(range(df_['time'].min(), df_['time'].max()+1))
    .ffill().unstack().reset_index(name='value')
)

Output:

  group  time  value
0     A     1  100.0
1     A     2  105.0
2     A     3  105.0
3     A     4  111.0
4     B     1  200.0
5     B     2  200.0
6     B     3  234.0
7     B     4  222.0
Sign up to request clarification or add additional context in comments.

4 Comments

this is fine for the given example but won't work in general, e.g. if the same time is missing for all groups, say for 'group': ['A']*3+['B']*3, 'time': [1,2,4,1,2,4],
@Stef it's not sure that OP wants to fill those, but if this is the case, your can reindex before the ffill
@mozway, thanks, it seems to work, however, with my large dataset, there may be cases where the same time is missing for all groups. Would it not work then?
@stevezissou if you want to add all numbers you need to add a reindex step, is this what you want? See the updated answer.
1

Instead of a template dataframe you could create a new index and then reindex with ffill:

new_idx = pd.MultiIndex.from_product([list('AB'), range(1,5)], names=['group', 'time'])
df_.set_index(['group', 'time']).reindex(new_idx, method='ffill').reset_index()

The result keeps the datatype of the value column:

  group  time  value
0     A     1    100
1     A     2    105
2     A     3    105
3     A     4    111
4     B     1    200
5     B     2    200
6     B     3    234
7     B     4    222

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.