Efficient way for calculating selected differences in array

Question

I have two arrays as an output from a simulation script where one contains IDs and one times, i.e. something like:

ids = np.array([2, 0, 1, 0, 1, 1, 2])
times = np.array([.1, .3, .3, .5, .6, 1.2, 1.3])

These arrays are always of the same size. Now I need to calculate the differences of times, but only for those times with the same ids. Of course, I can simply loop over the different ids an do

for id in np.unique(ids):
    diffs = np.diff(times[ids==id])
    print diffs
    # do stuff with diffs

However, this is quite inefficient and the two arrays can be very large. Does anyone have a good idea on how to do that more efficiently?

P. Camilleri · Accepted Answer · 2016-10-06 10:58:39Z

3

You can use array.argsort() and ignore the values corresponding to change in ids:

>>> id_ind = ids.argsort(kind='mergesort')
>>> times_diffs = np.diff(times[id_ind])
array([ 0.2, -0.2,  0.3,  0.6, -1.1,  1.2])

To see which values you need to discard, you could use a Counter to count the number of times per id (from collections import Counter)

or just sort ids, and see where its diff is nonzero: these are the indices where id change, and where you time diffs are irrelevant:

times_diffs[np.diff(ids[id_ind]) == 0] # ids[id_ind] being the sorted indices sequence

and finally you can split this array with np.split and np.where:

np.split(times_diffs, np.where(np.diff(ids[id_ind]) != 0)[0])

As you mentionned in your comment, argsort() default algorithm (quicksort) might not preserve order between equals times, so the argsort(kind='mergesort') option must be used.

edited Oct 6, 2016 at 10:58

answered Oct 5, 2016 at 12:01

P. Camilleri

13.3k10 gold badges49 silver badges85 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

obachtos Over a year ago

Is there a reason for using sorted(ids) when you already have the indecies for sorting the array, i.e. ids[id_ind]?

P. Camilleri Over a year ago

@obachtos Nope it was just laziness. Fixing it

obachtos Over a year ago

One more remark: argsort() with its standard algorithm quicksort might mess up the order of times. Its better to use the stable mergesort, i.e. argsort(kind='mergesort').

P. Camilleri Over a year ago

@obachtos Nice remark. In the future put it as a comment for me to edit my answer: if you try and edit it yourself, reviewers will reject it because "This edit deviates from the original intent of the post. Even edits that must make drastic changes should strive to preserve the goals of the post's owner."

obachtos · Accepted Answer · 2016-10-06 09:25:15Z

2

Say you np.argsort by ids:

inds = np.argsort(ids, kind='mergesort')
>>> array([1, 3, 2, 4, 5, 0, 6])

Now sort times by this, np.diff, and prepend a nan:

diffs = np.concatenate(([np.nan], np.diff(times[inds])))
>>> diffs 
array([ nan,  0.2, -0.2,  0.3,  0.6, -1.1,  1.2])

These differences are correct except for the boundaries. Let's calculate those

boundaries = np.concatenate(([False], ids[inds][1: ] == ids[inds][: -1]))
>>> boundaries
array([False,  True, False,  True,  True, False,  True], dtype=bool)

Now we can just do

diffs[~boundaries] = np.nan

Let's see what we got:

>>> ids[inds]
array([0, 0, 1, 1, 1, 2, 2])

>>> times[inds]
array([ 0.3,  0.5,  0.3,  0.6,  1.2,  0.1,  1.3])

>>> diffs
array([ nan,  0.2,  nan,  0.3,  0.6,  nan,  1.2])

edited Oct 6, 2016 at 9:25

obachtos

1,0611 gold badge15 silver badges31 bronze badges

answered Oct 5, 2016 at 12:02

Ami Tavory

76.7k13 gold badges152 silver badges196 bronze badges

Comments

Ami Tavory · Accepted Answer · 2016-10-05 12:37:11Z

1

I'm adding another answer, since, even though these things are possible in numpy, I think that the higher-level pandas is much more natural for them.

In pandas, you could do this in one step, after creating a DataFrame:

df = pd.DataFrame({'ids': ids, 'times': times})

df['diffs'] = df.groupby(df.ids).transform(pd.Series.diff)

This gives:

>>> df
   ids  times  diffs
0    2    0.1    NaN
1    0    0.3    NaN
2    1    0.3    NaN
3    0    0.5    0.2
4    1    0.6    0.3
5    1    1.2    0.6
6    2    1.3    1.2

answered Oct 5, 2016 at 12:37

Ami Tavory

76.7k13 gold badges152 silver badges196 bronze badges

1 Comment

Shane S Over a year ago

This is a good answer. Let me add, sometimes your dataframe has multiple columns. It is a good idea to include the columns when it is needed. df['diffs'] = df.groupby(['ids'])['times'].transform(pd.Series.diff)

Eelco Hoogendoorn · Accepted Answer · 2016-10-07 07:44:53Z

1

The numpy_indexed package (disclaimer: I am its author) contains efficient and flexible functionality for these kind of grouping operations:

import numpy_indexed as npi
unique_ids, diffed_time_groups = npi.group_by(keys=ids, values=times, reduction=np.diff)

Unlike pandas, it does not require a specialized datastructure just to perform this kind of rather elementary operation.

edited Oct 7, 2016 at 7:44

answered Oct 5, 2016 at 13:06

Eelco Hoogendoorn

10.8k1 gold badge46 silver badges43 bronze badges

3 Comments

Ami Tavory Over a year ago

In general, when someone promotes his/her own library, it's customary to add a disclaimer that he/she is the author.

Eelco Hoogendoorn Over a year ago

Ah yes; I am in the habit of doing so, but I forgot; thanks.

Ami Tavory Over a year ago

Good luck with your package.

Collectives™ on Stack Overflow

Efficient way for calculating selected differences in array

4 Answers 4

4 Comments

Comments

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related