2

I have an array as so:

myarray = [['a', 'b', 'c'],
           ['b', 'c', 'd'],
           ['c', 'd', 'e']]

And for this, np.unique(myarray, return_counts=True) works amazingly and gives me the desired output. However I would then like to apply it row by row, and for it to be able to tell me that in row number 1, the counts for d and e are 0.

For the moment I've been trying to add them to the array row each iteration during a for loop and then subtracting 1 to each count, but even that has me confused. I've tried these two solutions:

for i in range(mylen):
    unique, counts = np.unique(np.array([list(myarray[i]), 'a', 'b', 'c', 'd', 'e']), return_counts=True) # attempt 1
    unique, counts = np.unique(np.vstack((myarray[i], 'a', 'b', 'c', 'd', 'e')), return_counts=True) # attempt 2

But neither works. Does anyone have an elegant solution? This will be used for thousands, perhaps millions, of values, so computation time is somewhat relevant to the discussion.

3 Answers 3

2

You can use np.unique with return_inverse=True to get what you want:

letters, inv = np.unique(myarray, return_inverse=True)
inv = inv.reshape(myarray.shape)

inv is now

array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4]], dtype=int64)

You can get counts of all the unique elements in one line:

>>> (inv == np.arange(len(letters)).reshape(-1, 1, 1)).sum(-1)
array([[1, 0, 0],
       [1, 1, 0],
       [1, 1, 1],
       [0, 1, 1],
       [0, 0, 1]])

The first dimension corresponds to the letter in letters, the second to the row number, since sum(-1) sums across the columns. You can get counts for the columns using sum(1) instead. In your symmetrical example, the result will be identical.

No looping, no np.apply_along_axis (which is a glorified loop), all vectorized. Here is a quick timing test:

np.random.seed(42)
myarray = np.random.choice(list(string.ascii_lowercase), size=(100, 100))

def Epsi95(arr):
    uniques = np.unique(arr)
    def fun(x):
        base_dict = dict(zip(uniques, [0]*uniques.shape[0]))
        base_dict.update(dict(zip(*np.unique(x, return_counts=True))))
        return [i[-1] for i in sorted(base_dict.items())]
    return np.apply_along_axis(fun, 1, arr)

def MadPhysicist(myarray):
    letters, inv = np.unique(myarray, return_inverse=True)
    inv = inv.reshape(myarray.shape)
    return (inv == np.arange(len(letters)).reshape(-1, 1, 1)).sum(-1)    

%timeit Epsi95(myarray)
6.37 ms ± 26.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit MadPhysicist(myarray)
1.28 ms ± 6.85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sign up to request clarification or add additional context in comments.

3 Comments

Ya, this solution is much more superior interms of efficiency.
This looks fantastic, thanks. Just so that I'm clear on a couple of points: 1) the array you show after (inv == np.arange(len(letters)).reshape(-1, 1, 1)).sum(-1) is for each possibility in my universe, the number of time it appears in each row, right? So [1,0,0] means it appears once in the first row, then not at all in rows 2 and 3. 2) If I wanted to apply this to columns instead of rows, how would that work? Would it be easiest to just transpose myarray and apply the same function?
You can play with the dimensions yes. And yes to the interpretation of the data.
1
myarray = [['a', 'b', 'c'],
           ['b', 'c', 'd'],
           ['c', 'd', 'e']]

arr = np.array(myarray)

uniques = np.unique(arr)

def fun(x):
    base_dict = dict(zip(uniques, [0]*uniques.shape[0]))
    base_dict.update(dict(zip(*np.unique(x, return_counts=True))))
    return [i[-1] for i in sorted(base_dict.items())]

np.apply_along_axis(fun, 1, arr)

# array([[1, 1, 1, 0, 0], # a=1 b=1 c=1 d=0 e=0
#        [0, 1, 1, 1, 0],
#        [0, 0, 1, 1, 1]], dtype=int64)

3 Comments

apply_along_axis is just a glorified for loop, no matter what the docs may tell you.
You never need to sort the output of unique
you are correct (both the cases), in second case actually I was doing return list(base_dict.values()) since dict 3.6+ keeps order, but later thought to generalize and forgot to remove the first sort.
0

You can iterate over the rows of the list and then by the unique values of the entire set. Giving an example below, and this can be used to insert the elements into a dictionary or any other structure of your choosing.

Example:

import numpy as np

myarray = [['a', 'b', 'c'],
           ['b', 'c', 'd'],
           ['c', 'd', 'e']]

uniq = np.unique(np.array(myarray))

for idx, row in enumerate(myarray):
    for x in uniq:
        print(f"Row {idx} Element ({x}) Count: {row.count(x)}")

Output:

Row 0 Element (a) Count: 1
Row 0 Element (b) Count: 1
Row 0 Element (c) Count: 1
Row 0 Element (d) Count: 0
Row 0 Element (e) Count: 0
Row 1 Element (a) Count: 0
Row 1 Element (b) Count: 1
Row 1 Element (c) Count: 1
Row 1 Element (d) Count: 1
Row 1 Element (e) Count: 0
Row 2 Element (a) Count: 0
Row 2 Element (b) Count: 0
Row 2 Element (c) Count: 1
Row 2 Element (d) Count: 1
Row 2 Element (e) Count: 1

To use a list of dictionaries for each row:

import numpy as np

myarray = [['a', 'b', 'c'],
           ['b', 'c', 'd'],
           ['c', 'd', 'e']]

uniq = np.unique(np.array(myarray))
row_vals = []

for idx, row in enumerate(myarray):
    dict = {}
    for x in uniq:
        dict[x] = row.count(x)
    row_vals.append(dict)

for r in row_vals:
    print(r)

Output:

{'a': 1, 'b': 1, 'c': 1, 'd': 0, 'e': 0}
{'a': 0, 'b': 1, 'c': 1, 'd': 1, 'e': 0}
{'a': 0, 'b': 0, 'c': 1, 'd': 1, 'e': 1}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.