25

I have time series data from three completely different sensor sources as CSV files and want to combine them into one big CSV file. I've managed to read them into numpy using numpy's genfromtxt, but I'm not sure what to do from here.

Basically, what I have is something like this:

Table 1:

timestamp    val_a   val_b   val_c

Table 2:

timestamp    val_d   val_e   val_f   val_g

Table 3:

timestamp    val_h   val_i

All timestamps are UNIX millisecond timestamps as numpy.uint64.

And what I want is:

timestamp    val_a   val_b   val_c   val_d   val_e   val_f   val_g   val_h   val_i

...where all data is combined and ordered by timestamps. Each of the three tables is already ordered by timestamp. Since the data comes from different sources, there is no guarantee that a timestamp from table 1 will also be in table 2 or 3 and vice versa. In that case, the empty values should be marked as N/A.

So far, I have tried using pandas to convert the data like so:

df_sensor1 = pd.DataFrame(numpy_arr_sens1)
df_sensor2 = pd.DataFrame(numpy_arr_sens2)
df_sensor3 = pd.DataFrame(numpy_arr_sens3)

and then tried using pandas.DataFrame.merge, but I'm pretty sure that won't work for what I'm trying to do now. Can anyone point me in the right direction?

2
  • 2
    Can you show what you tried with merge, for instance it should work if you did merged = pd.merge(df_sensor1, df_sensor_2, on='timestamp') and then repeat for df_seonsor3, or if you set the index to timestamp on all the dfs then you could just do pd.concat([df_sensor_1, df_seonsor2, df_sensor3]) Commented Aug 25, 2015 at 22:21
  • Thank you for the quick answer! I used merge exactly like you wrote, but that apparently does an inner join, so only data points that have timestamps in all tables are written to the merged table. I've tried an outer join as well, which does include all the data but also doesn't get the ordering right. I did just try concat though. I did merged = pd.concat([df_sensor1, df_sensor2, df_sensor3], axis=1) and merged.to_csv('out.csv', sep=';', header=True, index=True, na_rep='N/A') and that seems to have done the job. I'll have to verify it tomorrow. Commented Aug 25, 2015 at 23:32

1 Answer 1

23

I think that you can simply

  • Define the timestamp as the index of each DataFrame (use of set_index)
  • Use a join to merge them with the 'outer' method
  • Optionnaly convert timestamp to datetime

Here is what it looks like.

# generating some test data
timestamp = [1440540000, 1450540000]
df1 = pd.DataFrame(
    {'timestamp': timestamp, 'a': ['val_a', 'val2_a'], 'b': ['val_b', 'val2_b'], 'c': ['val_c', 'val2_c']})
# building a different index
timestamp = timestamp * np.random.randn(abs(1))
df2 = pd.DataFrame(
    {'timestamp': timestamp, 'd': ['val_d', 'val2_d'], 'e': ['val_e', 'val2_e'], 'f': ['val_f', 'val2_f'],
     'g': ['val_g', 'val2_g']}, index=index)
# keeping a value in common with the first index
timestamp = [1440540000, 1450560000]
df3 = pd.DataFrame({'timestamp': timestamp, 'h': ['val_h', 'val2_h'], 'i': ['val_i', 'val2_i']}, index=index)

# Setting the timestamp as the index
df1.set_index('timestamp', inplace=True)
df2.set_index('timestamp', inplace=True)
df3.set_index('timestamp', inplace=True)

# You can convert timestamps to dates but it's not mandatory I think
df1.index = pd.to_datetime(df1.index, unit='s')
df2.index = pd.to_datetime(df2.index, unit='s')
df3.index = pd.to_datetime(df3.index, unit='s')

# Just perform a join and that's it
result = df1.join(df2, how='outer').join(df3, how='outer')
result

result

Sign up to request clarification or add additional context in comments.

3 Comments

If you use this solution and have many tables (or an unknown dynamic amount of them), then it is possible to put the join operations within the reduce function, meaning much less code. Also, I believe the pandas.merge() will generalise the join() method used above. First, from functools import reduce, then result = reduce(lambda left, right: pd.merge(left, right, left_on='timestamp', right_on='timestamp', how='outer'), df_list) - where the last argument df_list is a list of your DataFrames, e.g. df_list = [df1, df2, df3, ..., df_n].
@n1k31t4 additionally also left_index and right_index should be set to True. Sow e have pd.merge(left, right, left_on='timestamp', right_on='timestamp', how='outer', lefT_index=True, right_index=True).
@n1k31t4 according to the docs, join can be given a list: "Efficiently join multiple DataFrame objects by index at once by passing a list."

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.