Add random data between rows in a pandas dataframe

Question

I've got a pandas dataframe like this. This contains a timestamp, id, foo and bar. The timestamp data is around every 10 minutes.

timestamp            id  foo  bar
2019-04-14 00:00:10  1   0.10 0.05
2019-04-14 00:10:02  1   0.30 0.10
2019-04-14 00:00:00  2   0.10 0.05
2019-04-14 00:10:00  2   0.30 0.10

For each id, I'd like to create 5 additional rows with timestamp split equally between successive rows, and foo & bar values containing random values between the successive rows.

The start time should be the earliest timestamp for each id and the end time should be the latest timestamp for each id

So the output would be like this.

timestamp            id  foo  bar
2019-04-14 00:00:10  1   0.10 0.05
2019-04-14 00:02:10  1   0.14 0.06
2019-04-14 00:04:10  1   0.11 0.06
2019-04-14 00:06:10  1   0.29 0.07
2019-04-14 00:08:10  1   0.22 0.09
2019-04-14 00:10:02  1   0.30 0.10
2019-04-14 00:00:00  2   0.80 0.50
2019-04-14 00:02:00  2   0.45 0.48
2019-04-14 00:04:00  2   0.52 0.42
2019-04-14 00:06:00  2   0.74 0.48
2019-04-14 00:08:00  2   0.41 0.45
2019-04-14 00:10:00  2   0.40 0.40

I can reindex the timestamp column and create additional timestamp rows (eg. Pandas create new date rows and forward fill column values).

But I can't seem to wrap my head around how to compute the random values for foo and bar between the successive rows.

Appreciate if someone can point me in the right direction!

jezrael · Accepted Answer · 2019-04-14 13:02:01Z

The close, what you need is use date_range with DataFrame.reindex by first and last value of DatetimeIndex:

df['timestamp'] = pd.to_datetime(df['timestamp'])

df = (df.set_index('timestamp')
        .groupby('id')['foo','bar']
        .apply(lambda x: x.reindex(pd.date_range(x.index[0], x.index[-1], periods=6))))

Then create helper DataFrame with same size like original and DataFrame.fillna missing values:

df1 = pd.DataFrame(np.random.rand(*df.shape), index=df.index, columns=df.columns)
df = df.fillna(df1)
print (df)
                                 foo       bar
id                                            
1  2019-04-14 00:00:10.000  0.100000  0.050000
   2019-04-14 00:02:08.400  0.903435  0.755841
   2019-04-14 00:04:06.800  0.956002  0.253878
   2019-04-14 00:06:05.200  0.388454  0.257639
   2019-04-14 00:08:03.600  0.225535  0.195306
   2019-04-14 00:10:02.000  0.300000  0.100000
2  2019-04-14 00:00:00.000  0.100000  0.050000
   2019-04-14 00:02:00.000  0.180865  0.327581
   2019-04-14 00:04:00.000  0.417956  0.414400
   2019-04-14 00:06:00.000  0.012686  0.800948
   2019-04-14 00:08:00.000  0.716216  0.941396
   2019-04-14 00:10:00.000  0.300000  0.100000

Erfan · Accepted Answer · 2019-04-14 13:30:59Z

1

If the 'randomness' is not as crucial. We can use Series.interpolate which will keep the values between your min and max per group:

df_new = pd.concat([
    d.reindex(pd.date_range(d.timestamp.min(), d.timestamp.max(), periods=6))
    for _, d in df.groupby('id')
])
df_new['timestamp'] = df_new.index
df_new.reset_index(drop=True, inplace=True)

df_new = df_new[['timestamp']].merge(df, on='timestamp', how='left')
df_new['id'].fillna(method='ffill', inplace=True)

df_new[['foo', 'bar']] = df_new[['foo', 'bar']].apply(lambda x: x.interpolate())

Which gives the following output:

print(df_new)
                 timestamp   id   foo   bar
0  2019-04-14 00:00:10.000  1.0  0.10  0.05
1  2019-04-14 00:02:08.400  1.0  0.14  0.06
2  2019-04-14 00:04:06.800  1.0  0.18  0.07
3  2019-04-14 00:06:05.200  1.0  0.22  0.08
4  2019-04-14 00:08:03.600  1.0  0.26  0.09
5  2019-04-14 00:10:02.000  1.0  0.30  0.10
6  2019-04-14 00:00:00.000  2.0  0.10  0.05
7  2019-04-14 00:02:00.000  2.0  0.14  0.06
8  2019-04-14 00:04:00.000  2.0  0.18  0.07
9  2019-04-14 00:06:00.000  2.0  0.22  0.08
10 2019-04-14 00:08:00.000  2.0  0.26  0.09
11 2019-04-14 00:10:00.000  2.0  0.30  0.10

answered Apr 14, 2019 at 13:30

Erfan

43.3k10 gold badges75 silver badges86 bronze badges

1 Comment

Kvothe Over a year ago

Hey @erfan thanks for this! the randomness isn't crucial. However, the min and max values should be based on successive rows. Whereas your solution provides random data based on min and max of the whole group.

Collectives™ on Stack Overflow

Add random data between rows in a pandas dataframe

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related