Count number of identical values in a column within a numpy array

Question

I'm looking for a solution to the following problem:

Let's say I have an array with shape (4, 4):

[5. 4. 5. 4.]
[2. 3. 5. 5.]
[2. 1. 5. 1.]
[1. 3. 1. 3.]

Within this array there is one column in which the value "5" appears 3 times in a row. That is, they are not scattered across the column, as exemplified below.

[5.] # This
[1.] # Should
[5.] # Not
[5.] # Count

Now let's say I have a bigger array with shape (M,N) and various integer values in the same range of 1-5. How would I go about counting the maximum number of identical values appearing in a row per column? Furthermore, is it possible to obtain the indices these values would appear at? The expected output of the above example would be

Found 3 in a row of number 5 in column 2
(0,2), (1,2), (2,2)

I assume that the implementation would be similar if the search should concern rows. If not I'd love to know how this is done as well.

In general you try to find streaks in a row. You can make use of docs.python.org/2/library/itertools.html#itertools.groupby. A working example can be found here stackoverflow.com/questions/28839607/… (2nd answer). Within that loop you can keep track of the index the streak starts. If you do this for all columns you will find your result. — user
– user, Commented Aug 26, 2018 at 20:30

Divakar · Accepted Answer · 2018-08-26 21:26:57Z

Approach #1

Here's one approach -

def find_longest_island_indices(a, values):
    b = np.pad(a, ((1,1),(0,0)), 'constant')
    shp = np.array(b.shape)[::-1] - [0,1]
    maxlens = []
    final_out = []
    for v in values:
        m = b==v        
        idx = np.flatnonzero((m[:-1] != m[1:]).T)
        s0,s1 = idx[::2], idx[1::2]        
        l = s1-s0
        maxidx = l.argmax()
        longest_island_flatidx = np.r_[s0[maxidx]:s1[maxidx]]            
        r,c = np.unravel_index(longest_island_flatidx, shp)
        final_out.append(np.c_[c,r])
        maxlens.append(l[maxidx])
    return maxlens, final_out

Sample run -

In [169]: a
Out[169]: 
array([[5, 4, 5, 4],
       [2, 3, 5, 5],
       [2, 1, 5, 1],
       [1, 3, 1, 3]])

In [173]: maxlens
Out[173]: [1, 2, 1, 1, 3]

In [174]: out
Out[174]: 
[array([[3, 0]]), array([[1, 0],
        [2, 0]]), array([[1, 1]]), array([[0, 1]]), array([[0, 2],
        [1, 2],
        [2, 2]])]

# With "pretty" printing
In [171]: maxlens, out = find_longest_island_indices(a, [1,2,3,4,5])
     ...: for  l,o,i in zip(maxlens,out,[1,2,3,4,5]):
     ...:     print "For "+str(i)+" : L= "+str(l)+", Idx = "+str(o.tolist())
For 1 : L= 1, Idx = [[3, 0]]
For 2 : L= 2, Idx = [[1, 0], [2, 0]]
For 3 : L= 1, Idx = [[1, 1]]
For 4 : L= 1, Idx = [[0, 1]]
For 5 : L= 3, Idx = [[0, 2], [1, 2], [2, 2]]

Approach #2

With a bit of modification and outputting the start and end indices for the max-length island, here's one -

def find_longest_island_indices_v2(a, values):
    b = np.pad(a.T, ((0,0),(1,1)), 'constant')
    shp = b.shape
    out = []
    for v in values:
        m = b==v        
        idx = np.flatnonzero(m.flat[:-1] != m.flat[1:])
        s0,s1 = idx[::2], idx[1::2]        
        l = s1-s0
        maxidx = l.argmax()
        start_index = np.unravel_index(s0[maxidx], shp)[::-1]
        end_index = np.unravel_index(s1[maxidx]-1, shp)[::-1]
        maxlen = l[maxidx]
        out.append([v,maxlen, start_index, end_index])
    return out

Sample run -

In [251]: a
Out[251]: 
array([[5, 4, 5, 4],
       [2, 3, 5, 5],
       [2, 1, 5, 1],
       [1, 3, 1, 3]])

In [252]: out = find_longest_island_indices_v2(a, [1,2,3,4,5])

In [255]: out
Out[255]: 
[[1, 1, (3, 0), (3, 0)],
 [2, 2, (1, 0), (2, 0)],
 [3, 1, (1, 1), (1, 1)],
 [4, 1, (0, 1), (0, 1)],
 [5, 3, (0, 2), (2, 2)]]

# With some pandas styled printing 
In [253]: import pandas as pd

In [254]: pd.DataFrame(out, columns=['Val','MaxLen','StartIdx','EndIdx'])
Out[254]: 
   Val  MaxLen StartIdx  EndIdx
0    1       1   (3, 0)  (3, 0)
1    2       2   (1, 0)  (2, 0)
2    3       1   (1, 1)  (1, 1)
3    4       1   (0, 1)  (0, 1)
4    5       3   (0, 2)  (2, 2)

Joe Iddon · Accepted Answer · 2018-08-26 20:41:45Z

If we store the maximum length of a run of identical values in a column in a variable, then we can iterate through looking for runs of greater length.

If the following requires more explanation, just say!

a = np.array([[5,4,5,4],[2,3,5,5],[2,1,5,1],[1,3,1,3]])
rows, cols = a.shape
max_length = 0
for ci in range(cols):
    for ri in range(rows):
         if ri == 0:                  #start of run
             start_pos = (ri, ci)
             length = 1
         elif a[ri,ci] == a[ri-1,ci]: #during run
             length += 1
         else:                        #end of run
             if length > max_length:
                 max_length = length
                 max_pos = start_pos

max_row, max_col = max_pos
print('Found {} in a row of number {} in column {}'.format(max_length, a[max_pos], max_col))
for i in range(max_length):
     print((max_row+i, max_col))

Output:

Found 3 in a row of number 5 in column 2
(0, 2)
(1, 2)
(2, 2)

Note that if you would like the output of the tuples to be in the exact format you stated, then you can use a generator-expression with str.join:

print((max_row+i, max_col) for i in range(max_length)

Dani Mesejo · Accepted Answer · 2018-08-27 00:06:13Z

Another approach is to use the itertools.groupby as suggested by @user, a possible implementation is the following:

import numpy as np
from itertools import groupby


def runs(column):
    max_run_length, start, indices, max_value = -1, 0, 0, 0
    for val, run in groupby(column):
        run_length = sum(1 for _ in run)
        if run_length > max_run_length:
            max_run_length, start, max_value = run_length, indices, val
        indices += run_length

    return max_value, max_run_length, start

The function above computes the length of the maximum run, the start and the corresponding value for a given column (row). With these values you can compute your expected output. The groupby is the one that does all the heavy lifting, for the array [5., 5., 5., 1.],

[(val, sum(1 for _ in run)) for val, run in groupby([5., 5., 5., 1.])]

the previous line outputs: [(5.0, 3), (1.0, 1)]. The loop keeps the starting index of the largest run, the length and the values of it. To apply the function to the columns you can use the numpy.apply_along_axis:

data = np.array([[5., 4., 5., 4.],
                 [2., 3., 5., 5.],
                 [2., 1., 5., 1.],
                 [1., 3., 1., 3.]])

result = [tuple(row) for row in np.apply_along_axis(runs, 0, data).T]
print(result)

Output

[(2.0, 2.0, 1.0), (4.0, 1.0, 0.0), (5.0, 3.0, 0.0), (4.0, 1.0, 0.0)]

In the output above the fourth tuple corresponds to the fourth column the value of the longest consecutive run is 5, the length is 3 and starts at index 0. To change to rows instead of columns change the index of the axis to 1 and drop the T, like this:

result = [tuple(row) for row in np.apply_along_axis(runs, 1, data)]

Output

[(5.0, 1.0, 0.0), (5.0, 2.0, 2.0), (2.0, 1.0, 0.0), (1.0, 1.0, 0.0)]

Collectives™ on Stack Overflow

Count number of identical values in a column within a numpy array

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related