Merging time series data by timestamp using numpy/pandas

Question

I have time series data from three completely different sensor sources as CSV files and want to combine them into one big CSV file. I've managed to read them into numpy using numpy's genfromtxt, but I'm not sure what to do from here.

Basically, what I have is something like this:

Table 1:

timestamp    val_a   val_b   val_c

Table 2:

timestamp    val_d   val_e   val_f   val_g

Table 3:

timestamp    val_h   val_i

All timestamps are UNIX millisecond timestamps as numpy.uint64.

And what I want is:

timestamp    val_a   val_b   val_c   val_d   val_e   val_f   val_g   val_h   val_i

...where all data is combined and ordered by timestamps. Each of the three tables is already ordered by timestamp. Since the data comes from different sources, there is no guarantee that a timestamp from table 1 will also be in table 2 or 3 and vice versa. In that case, the empty values should be marked as N/A.

So far, I have tried using pandas to convert the data like so:

df_sensor1 = pd.DataFrame(numpy_arr_sens1)
df_sensor2 = pd.DataFrame(numpy_arr_sens2)
df_sensor3 = pd.DataFrame(numpy_arr_sens3)

and then tried using pandas.DataFrame.merge, but I'm pretty sure that won't work for what I'm trying to do now. Can anyone point me in the right direction?

Can you show what you tried with merge, for instance it should work if you did merged = pd.merge(df_sensor1, df_sensor_2, on='timestamp') and then repeat for df_seonsor3, or if you set the index to timestamp on all the dfs then you could just do pd.concat([df_sensor_1, df_seonsor2, df_sensor3]) — EdChum
– EdChum, Commented Aug 25, 2015 at 22:21
Thank you for the quick answer! I used merge exactly like you wrote, but that apparently does an inner join, so only data points that have timestamps in all tables are written to the merged table. I've tried an outer join as well, which does include all the data but also doesn't get the ordering right. I did just try concat though. I did merged = pd.concat([df_sensor1, df_sensor2, df_sensor3], axis=1) and merged.to_csv('out.csv', sep=';', header=True, index=True, na_rep='N/A') and that seems to have done the job. I'll have to verify it tomorrow. — vind
– vind, Commented Aug 25, 2015 at 23:32

Romain · Accepted Answer · 2015-08-27 11:12:50Z

23

I think that you can simply

Define the timestamp as the index of each DataFrame (use of set_index)
Use a join to merge them with the 'outer' method
Optionnaly convert timestamp to datetime

Here is what it looks like.

# generating some test data
timestamp = [1440540000, 1450540000]
df1 = pd.DataFrame(
    {'timestamp': timestamp, 'a': ['val_a', 'val2_a'], 'b': ['val_b', 'val2_b'], 'c': ['val_c', 'val2_c']})
# building a different index
timestamp = timestamp * np.random.randn(abs(1))
df2 = pd.DataFrame(
    {'timestamp': timestamp, 'd': ['val_d', 'val2_d'], 'e': ['val_e', 'val2_e'], 'f': ['val_f', 'val2_f'],
     'g': ['val_g', 'val2_g']}, index=index)
# keeping a value in common with the first index
timestamp = [1440540000, 1450560000]
df3 = pd.DataFrame({'timestamp': timestamp, 'h': ['val_h', 'val2_h'], 'i': ['val_i', 'val2_i']}, index=index)

# Setting the timestamp as the index
df1.set_index('timestamp', inplace=True)
df2.set_index('timestamp', inplace=True)
df3.set_index('timestamp', inplace=True)

# You can convert timestamps to dates but it's not mandatory I think
df1.index = pd.to_datetime(df1.index, unit='s')
df2.index = pd.to_datetime(df2.index, unit='s')
df3.index = pd.to_datetime(df3.index, unit='s')

# Just perform a join and that's it
result = df1.join(df2, how='outer').join(df3, how='outer')
result

edited Aug 27, 2015 at 11:12

answered Aug 26, 2015 at 21:58

Romain

22.2k6 gold badges63 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

n1k31t4 Over a year ago

If you use this solution and have many tables (or an unknown dynamic amount of them), then it is possible to put the join operations within the reduce function, meaning much less code. Also, I believe the pandas.merge() will generalise the join() method used above. First, from functools import reduce, then

result = reduce(lambda left, right: pd.merge(left, right, left_on='timestamp', right_on='timestamp', how='outer'), df_list)

- where the last argument df_list is a list of your DataFrames, e.g. df_list = [df1, df2, df3, ..., df_n].

arash javanmard Over a year ago

@n1k31t4 additionally also left_index and right_index should be set to True. Sow e have pd.merge(left, right, left_on='timestamp', right_on='timestamp', how='outer', lefT_index=True, right_index=True).

Jérôme Over a year ago

@n1k31t4 according to the docs, join can be given a list: "Efficiently join multiple DataFrame objects by index at once by passing a list."

Collectives™ on Stack Overflow

Merging time series data by timestamp using numpy/pandas

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related