I am working on some data processing in a Rails app and I am trying to deal with into a performance pain-point. I have 2 arrays x_data and y_data that each looks as follows (With different values of course):
[
{ 'timestamp_value' => '2017-01-01 12:00', 'value' => '432' },
{ 'timestamp_value' => '2017-01-01 12:01', 'value' => '421' },
...
]
Each array has up to perhaps 25k items. I need to prepare this data for further x-y regression analysis.
Now, some values in x_data or y_data can be nil. I need to remove values from both arrays if either x_data or y_data has a nil value at that timestamp. I then need to return the values only for both arrays.
In my current approach, I am first extracting the timestamps from both arrays where the values are not nil, then performing a set intersection on the timestamps to produce a final timestamps array. I then select values using that final array of timestamps. Here's the code:
def values_for_regression(x_data, y_data)
x_timestamps = timestamps_for(x_data)
y_timestamps = timestamps_for(y_data)
# Get final timestamps as the intersection of the two
timestamps = x_timestamps.intersection(y_timestamps)
x_values = values_for(x_data, timestamps)
y_values = values_for(y_data, timestamps)
[x_values, y_values]
end
def timestamps_for(data)
Set.new data.reject { |row| row['value'].nil? }.
map { |row| row['timestamp_value'] }
end
def values_for(data, timestamps)
data.select { |row| timestamps.include?(row['timestamp_value']) }.
map { |row| row['value'] }
end
This approach isn't terribly performant, and I need to do this on several sets of data in quick succession. The overhead of the multiple loops adds up. There must be a way to at least reduce the number of loops necessary.
Any ideas or suggestions will be appreciated.
{ '2017-01-01 12:00' => [123, 456] }would be much easier to work with.