Efficiently count the number of occurrences of unique subarrays in NumPy?

Question

I have an array of shape (128, 36, 8) and I'd like to find the number of occurrences of the unique subarrays of length 8 in the last dimension.

I'm aware of np.unique and np.bincount, but those seem to be for elements rather than subarrays. I've seen this question but it's about finding the first occurrence of a particular subarray, rather than the counts of all unique subarrays.

I couldn't think up a way to do it within numpy, but would a trie be too slow? It would only need to access each element once and then at the end you automatically have the number of unique subarrays as well as their locations if you stored them. — KobeJohn
– KobeJohn, Commented Jun 16, 2015 at 23:51
Here is a closely related question, stackoverflow.com/questions/8560440/…. The basic idea is you sort the subarrays (lexicographic sort). Once the similar subarrays are grouped it's trivial to identify and count them. — Bi Rico
– Bi Rico, Commented Jun 17, 2015 at 0:07

Divakar · Accepted Answer · 2015-06-18 05:38:34Z

4

The question states that the input array is of shape (128, 36, 8) and we are interested in finding unique subarrays of length 8 in the last dimension. So, I am assuming that the uniqueness is along the first two dimensions being merged together. Let us assume A as the input 3D array.

Get the number of unique subarrays

# Reshape the 3D array to a 2D array merging the first two dimensions
Ar = A.reshape(-1,A.shape[2])

# Perform lex sort and get the sorted indices and xy pairs
sorted_idx = np.lexsort(Ar.T)
sorted_Ar =  Ar[sorted_idx,:]

# Get the count of rows that have at least one TRUE value 
# indicating presence of unique subarray there
unq_out = np.any(np.diff(sorted_Ar,axis=0),1).sum()+1

Sample run -

In [159]: A # A is (2,2,3)
Out[159]: 
array([[[0, 0, 0],
        [0, 0, 2]],

       [[0, 0, 2],
        [2, 0, 1]]])

In [160]: unq_out
Out[160]: 3

Get the count of occurrences of unique subarrays

# Reshape the 3D array to a 2D array merging the first two dimensions
Ar = A.reshape(-1,A.shape[2])

# Perform lex sort and get the sorted indices and xy pairs
sorted_idx = np.lexsort(Ar.T)
sorted_Ar =  Ar[sorted_idx,:]

# Get IDs for each element based on their uniqueness
id = np.append([0],np.any(np.diff(sorted_Ar,axis=0),1).cumsum())

# Get counts for each ID as the final output
unq_count = np.bincount(id)

Sample run -

In [64]: A
Out[64]: 
array([[[0, 0, 2],
        [1, 1, 1]],

       [[1, 1, 1],
        [1, 2, 0]]])

In [65]: unq_count
Out[65]: array([1, 2, 1], dtype=int64)

edited Jun 18, 2015 at 5:38

answered Jun 17, 2015 at 5:26

Divakar

222k19 gold badges273 silver badges374 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Will Over a year ago

This is great—I hadn't thought to use np.lexsort and I didn't know about np.diff—but we're actually interested in finding the number of occurences of unique subarrays, not just the number of unique subarrays. Could this method be adapted to return the unique subarrays along with their counts, as @farhawa's answer?

Will Over a year ago

Awesome, thank you. BTW, my modification of your original answer seems to be slightly faster than your extension: ~668 µs vs ~685 µs.

Divakar Over a year ago

@Will Great! How about test it on a larger dataset, something like (1000, 1000, 8), if possible?

Will Over a year ago

I got similar results with a larger dataset: 423ms vs 433ms.

Will · Accepted Answer · 2015-06-26 22:38:19Z

Here I've modified @Divakar's very useful answer to return the counts of the unique subarrays, as well as the subarrays themselves, so that the output is the same as that of collections.Counter.most_common():

# Get the array in 2D form.
arr = arr.reshape(-1, arr.shape[-1])

# Lexicographically sort
sorted_arr = arr[np.lexsort(arr.T), :]

# Get the indices where a new row appears
diff_idx = np.where(np.any(np.diff(sorted_arr, axis=0), 1))[0]

# Get the unique rows
unique_rows = [sorted_arr[i] for i in diff_idx] + [sorted_arr[-1]]

# Get the number of occurences of each unique array (the -1 is needed at
# the beginning, rather than 0, because of fencepost concerns)
counts = np.diff(
    np.append(np.insert(diff_idx, 0, -1), sorted_arr.shape[0] - 1))

# Return the (row, count) pairs sorted by count
return sorted(zip(unique_rows, counts), key=lambda x: x[1], reverse=True)

farhawa · Accepted Answer · 2015-06-16 23:21:53Z

0

I am not sure that it's the most efficient way to do it but this should work.

arr = arr.reshape(128*36,8)
unique_ = []
occurence_ = []

for sub in arr:
    if sub.tolist() not in unique_:
        unique_.append(sub.tolist())
        occurence_.append(1)
    else:
        occurence_[unique_.index(sub.tolist())]+=1
for index_,u in unique_:
   print u,"occurrence: %s"%occurence_[index_]

edited Jun 16, 2015 at 23:21

answered Jun 16, 2015 at 23:14

farhawa

10.5k16 gold badges58 silver badges93 bronze badges

4 Comments

Will Over a year ago

This will work but I was looking to avoid functions that use native Python such as tolist and index, which are expensive. Thanks for the answer though.

Will Over a year ago

An obvious optimization to your method, by the way, would be to keep the counts in a dictionary where the keys are tuples of the subarrays, rather than in a list that we need to keep searching through with unique_.index.

Bi Rico Over a year ago

@Will or even better, use collections.Counter, counts = Counter(tuple(row) for row in arr) :)

Will Over a year ago

@BiRico, that's great, didn't know about that builtin!

Collectives™ on Stack Overflow

Efficiently count the number of occurrences of unique subarrays in NumPy?

3 Answers 3

4 Comments

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related