11

I have an array of shape (128, 36, 8) and I'd like to find the number of occurrences of the unique subarrays of length 8 in the last dimension.

I'm aware of np.unique and np.bincount, but those seem to be for elements rather than subarrays. I've seen this question but it's about finding the first occurrence of a particular subarray, rather than the counts of all unique subarrays.

2
  • I couldn't think up a way to do it within numpy, but would a trie be too slow? It would only need to access each element once and then at the end you automatically have the number of unique subarrays as well as their locations if you stored them. Commented Jun 16, 2015 at 23:51
  • Here is a closely related question, stackoverflow.com/questions/8560440/…. The basic idea is you sort the subarrays (lexicographic sort). Once the similar subarrays are grouped it's trivial to identify and count them. Commented Jun 17, 2015 at 0:07

3 Answers 3

4

The question states that the input array is of shape (128, 36, 8) and we are interested in finding unique subarrays of length 8 in the last dimension. So, I am assuming that the uniqueness is along the first two dimensions being merged together. Let us assume A as the input 3D array.

Get the number of unique subarrays

# Reshape the 3D array to a 2D array merging the first two dimensions
Ar = A.reshape(-1,A.shape[2])

# Perform lex sort and get the sorted indices and xy pairs
sorted_idx = np.lexsort(Ar.T)
sorted_Ar =  Ar[sorted_idx,:]

# Get the count of rows that have at least one TRUE value 
# indicating presence of unique subarray there
unq_out = np.any(np.diff(sorted_Ar,axis=0),1).sum()+1

Sample run -

In [159]: A # A is (2,2,3)
Out[159]: 
array([[[0, 0, 0],
        [0, 0, 2]],

       [[0, 0, 2],
        [2, 0, 1]]])

In [160]: unq_out
Out[160]: 3

Get the count of occurrences of unique subarrays

# Reshape the 3D array to a 2D array merging the first two dimensions
Ar = A.reshape(-1,A.shape[2])

# Perform lex sort and get the sorted indices and xy pairs
sorted_idx = np.lexsort(Ar.T)
sorted_Ar =  Ar[sorted_idx,:]

# Get IDs for each element based on their uniqueness
id = np.append([0],np.any(np.diff(sorted_Ar,axis=0),1).cumsum())

# Get counts for each ID as the final output
unq_count = np.bincount(id) 

Sample run -

In [64]: A
Out[64]: 
array([[[0, 0, 2],
        [1, 1, 1]],

       [[1, 1, 1],
        [1, 2, 0]]])

In [65]: unq_count
Out[65]: array([1, 2, 1], dtype=int64)
Sign up to request clarification or add additional context in comments.

4 Comments

This is great—I hadn't thought to use np.lexsort and I didn't know about np.diff—but we're actually interested in finding the number of occurences of unique subarrays, not just the number of unique subarrays. Could this method be adapted to return the unique subarrays along with their counts, as @farhawa's answer?
Awesome, thank you. BTW, my modification of your original answer seems to be slightly faster than your extension: ~668 µs vs ~685 µs.
@Will Great! How about test it on a larger dataset, something like (1000, 1000, 8), if possible?
I got similar results with a larger dataset: 423ms vs 433ms.
3

Here I've modified @Divakar's very useful answer to return the counts of the unique subarrays, as well as the subarrays themselves, so that the output is the same as that of collections.Counter.most_common():

# Get the array in 2D form.
arr = arr.reshape(-1, arr.shape[-1])

# Lexicographically sort
sorted_arr = arr[np.lexsort(arr.T), :]

# Get the indices where a new row appears
diff_idx = np.where(np.any(np.diff(sorted_arr, axis=0), 1))[0]

# Get the unique rows
unique_rows = [sorted_arr[i] for i in diff_idx] + [sorted_arr[-1]]

# Get the number of occurences of each unique array (the -1 is needed at
# the beginning, rather than 0, because of fencepost concerns)
counts = np.diff(
    np.append(np.insert(diff_idx, 0, -1), sorted_arr.shape[0] - 1))

# Return the (row, count) pairs sorted by count
return sorted(zip(unique_rows, counts), key=lambda x: x[1], reverse=True)

Comments

0

I am not sure that it's the most efficient way to do it but this should work.

arr = arr.reshape(128*36,8)
unique_ = []
occurence_ = []

for sub in arr:
    if sub.tolist() not in unique_:
        unique_.append(sub.tolist())
        occurence_.append(1)
    else:
        occurence_[unique_.index(sub.tolist())]+=1
for index_,u in unique_:
   print u,"occurrence: %s"%occurence_[index_]

4 Comments

This will work but I was looking to avoid functions that use native Python such as tolist and index, which are expensive. Thanks for the answer though.
An obvious optimization to your method, by the way, would be to keep the counts in a dictionary where the keys are tuples of the subarrays, rather than in a list that we need to keep searching through with unique_.index.
@Will or even better, use collections.Counter, counts = Counter(tuple(row) for row in arr) :)
@BiRico, that's great, didn't know about that builtin!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.