1

I'm not sure how to solve the following problem in an efficient vectorized fashion. The brute force method I came up with is ugly and slow.

I want to filter a strut based on multiple fields.

I have a flat strut of arrays that looks something like this.

s = 

   id: [1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2]
date1: [1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4]
date2: [1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2]
value: [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16]

Assume all arrays are of equal length, there may be also be numerous id's.

For each unique date2 value ( 1 and 2 in the above example), I want to return the latest/most recent date1 ( 4 in this example) and the associated id and value. The brute force method is very slow on struts with large struts ( 100k in length) with numerous ids'.

The returning result I'm looking using the above example would be:

s = 

       id: [1 1 2 2]
    date1: [4 4 4 4]
    date2: [1 2 1 2]
    value: [4 8 12 16]

1 Answer 1

2

You can do this by using diff. diff returns the difference between pairs of values in your array. This also computes the discrete approximation to the derivative, such that for an index i in your array x, and x_i accesses the ith element of your array, the output at index i for the output array y is thus:

y_i = x_{i+1} - x_i

Note that diff will return a M - 1 element vector given that your vector is of size M. As such, we will pad the last element with inf so that we can capture the last value of the array.

As soon as diff returns a non-zero difference, this is when a change happens and so this is the location that you need to keep track of that results in the very last time a number in date2 remained the same before transitioning.

As such, this is the algorithm I would do:

  1. Find where a change occurs in date2
  2. Use these indices to access the fields in your structure.
  3. Create a new structure that subsets each of the fields' arrays using the output in Step #1

Without further ado:

%// Set up example data
s.id = [1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2];
s.date1 = [1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4];
s.date2 = [1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2];
s.value = [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16];

%// Find where the transitions are
ind = find(diff([s.date2 inf]));

%// Subset each of the other arrays in the structure and
%// assign to output
sOut.id = s.id(ind);
sOut.date1 = s.date1(ind);
sOut.date2 = s.date2(ind);
sOut.value = s.value(ind);

sOut will contain the output structure. Given your example, the output is:

sOut = 

   id: [1 1 2 2]
date1: [4 4 4 4]
date2: [1 2 1 2]
value: [4 8 12 16]

Bonus - Time Measurement

I decided to time how long this took using randomly generated integers for id, date1, date2 and value of 100000 elements each using a seed of 123. I'm running MATLAB R2014a (v. 8.3.0.532) on an Intel i7-4770 @ 3.40 GHz with 8 GB of RAM. I used timeit to time how long this took. Here's the code I used:

function [] = measureTime()

format long;
T = timeit(@testing5);
disp(['Time elapsed: ' num2str(T)]);
end

function [] = testing5()
rng(123); %//Set for reproducibility
s.id = randi(100, 1, 100000);
s.date1 = randi(100, 1, 100000);
s.date2 = randi(100, 1, 100000);
s.value = randi(100, 1, 100000);
ind = find(diff([s.date2 inf]));
sOut.id = s.id(ind);
sOut.date1 = s.date1(ind);
sOut.date2 = s.date2(ind);
sOut.value = s.value(ind);
end

All you have to do is copy this code to a file, call it measureTime.m and run this in the MATLAB command prompt. As such, the time elapsed was:

>> measureTime

Time elapsed: 0.0086272

... so roughly 8.6 ms... pretty good!

Sign up to request clarification or add additional context in comments.

2 Comments

wow, if the array if out of order just need to sort it once or twice and its solved, this is very elegant, thanks for teaching me this trick! Would this fall under signal processing?
@pyCthon - Hm! If you sort it first, then your data will look a lot neater and there will be less entries. I assumed that the ids could be in any arbitrary order, but nice on the sorting! As for the tag, not sure if it would fall under signal processing for your question, but my answer uses concepts from signal processing. Tell you what... I'll leave it to you! I'm cool with the tag! You're also very welcome with the trick :) Someone on SO (I can't remember who...) said something that will echo through the ages: "MATLAB is powerful... it can even make you breakfast if you know how."

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.