3

I have two arrays as an output from a simulation script where one contains IDs and one times, i.e. something like:

ids = np.array([2, 0, 1, 0, 1, 1, 2])
times = np.array([.1, .3, .3, .5, .6, 1.2, 1.3])

These arrays are always of the same size. Now I need to calculate the differences of times, but only for those times with the same ids. Of course, I can simply loop over the different ids an do

for id in np.unique(ids):
    diffs = np.diff(times[ids==id])
    print diffs
    # do stuff with diffs

However, this is quite inefficient and the two arrays can be very large. Does anyone have a good idea on how to do that more efficiently?

4 Answers 4

3

You can use array.argsort() and ignore the values corresponding to change in ids:

>>> id_ind = ids.argsort(kind='mergesort')
>>> times_diffs = np.diff(times[id_ind])
array([ 0.2, -0.2,  0.3,  0.6, -1.1,  1.2])

To see which values you need to discard, you could use a Counter to count the number of times per id (from collections import Counter)

or just sort ids, and see where its diff is nonzero: these are the indices where id change, and where you time diffs are irrelevant:

times_diffs[np.diff(ids[id_ind]) == 0] # ids[id_ind] being the sorted indices sequence

and finally you can split this array with np.split and np.where:

np.split(times_diffs, np.where(np.diff(ids[id_ind]) != 0)[0])

As you mentionned in your comment, argsort() default algorithm (quicksort) might not preserve order between equals times, so the argsort(kind='mergesort') option must be used.

Sign up to request clarification or add additional context in comments.

4 Comments

Is there a reason for using sorted(ids) when you already have the indecies for sorting the array, i.e. ids[id_ind]?
@obachtos Nope it was just laziness. Fixing it
One more remark: argsort() with its standard algorithm quicksort might mess up the order of times. Its better to use the stable mergesort, i.e. argsort(kind='mergesort').
@obachtos Nice remark. In the future put it as a comment for me to edit my answer: if you try and edit it yourself, reviewers will reject it because "This edit deviates from the original intent of the post. Even edits that must make drastic changes should strive to preserve the goals of the post's owner."
2

Say you np.argsort by ids:

inds = np.argsort(ids, kind='mergesort')
>>> array([1, 3, 2, 4, 5, 0, 6])

Now sort times by this, np.diff, and prepend a nan:

diffs = np.concatenate(([np.nan], np.diff(times[inds])))
>>> diffs 
array([ nan,  0.2, -0.2,  0.3,  0.6, -1.1,  1.2])

These differences are correct except for the boundaries. Let's calculate those

boundaries = np.concatenate(([False], ids[inds][1: ] == ids[inds][: -1]))
>>> boundaries
array([False,  True, False,  True,  True, False,  True], dtype=bool)

Now we can just do

diffs[~boundaries] = np.nan

Let's see what we got:

>>> ids[inds]
array([0, 0, 1, 1, 1, 2, 2])

>>> times[inds]
array([ 0.3,  0.5,  0.3,  0.6,  1.2,  0.1,  1.3])

>>> diffs
array([ nan,  0.2,  nan,  0.3,  0.6,  nan,  1.2])

Comments

1

I'm adding another answer, since, even though these things are possible in numpy, I think that the higher-level pandas is much more natural for them.

In pandas, you could do this in one step, after creating a DataFrame:

df = pd.DataFrame({'ids': ids, 'times': times})

df['diffs'] = df.groupby(df.ids).transform(pd.Series.diff)

This gives:

>>> df
   ids  times  diffs
0    2    0.1    NaN
1    0    0.3    NaN
2    1    0.3    NaN
3    0    0.5    0.2
4    1    0.6    0.3
5    1    1.2    0.6
6    2    1.3    1.2

1 Comment

This is a good answer. Let me add, sometimes your dataframe has multiple columns. It is a good idea to include the columns when it is needed. df['diffs'] = df.groupby(['ids'])['times'].transform(pd.Series.diff)
1

The numpy_indexed package (disclaimer: I am its author) contains efficient and flexible functionality for these kind of grouping operations:

import numpy_indexed as npi
unique_ids, diffed_time_groups = npi.group_by(keys=ids, values=times, reduction=np.diff)

Unlike pandas, it does not require a specialized datastructure just to perform this kind of rather elementary operation.

3 Comments

In general, when someone promotes his/her own library, it's customary to add a disclaimer that he/she is the author.
Ah yes; I am in the habit of doing so, but I forgot; thanks.
Good luck with your package.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.