Counting combinations over pairs of columns in a numpy array

Question

I have a matrix with a certain number of columns that contain only the numbers 0 and 1, I want to count the number of [0, 0], [0, 1], [1, 0], and [1, 1] in each PAIR of columns.

So for example, if I have a matrix with four columns, I want to count the number of 00s, 11s, 01s, and 11s in the first and second column, append the final result to a list, then loop over the 3rd and 4th column and append that answer to the list.

Example input:

array([[0, 1, 1, 0],
       [1, 0, 1, 0],
       [0, 1, 0, 1],
       [0, 0, 1, 1],
       [1, 1, 0, 0]])

My expected output is:

array([[1, 1],
       [2, 1],
       [1, 2],
       [1, 1]])

Explanation:

The first two columns have [0, 0] once. The second two columns also have [0, 0] once. The first two columns have [0, 1] twice, and the second two columns have [0, 1] once... and so on.

This is my latest attempt and it seems to work. Would like feedback.

# for each pair of columns calculate haplotype frequencies
# haplotypes:
# h1 = 11
# h2 = 10
# h3 = 01
# h4 = 00
# takes as input a pair of columns
def calc_haplotype_freq(matrix):
    h1_frequencies = []
    h2_frequencies = []
    h3_frequencies = []
    h4_frequencies = []
    colIndex1 = 0
    colIndex2 = 1
    for i in range(0, 2): # number of columns divided by 2
        h1 = 0
        h2 = 0
        h3 = 0
        h4 = 0
        column_1 = matrix[:, colIndex1]
        column_2 = matrix[:, colIndex2]
        for row in range(0, matrix.shape[0]):
            if (column_1[row, 0] == 1).any() & (column_2[row, 0] == 1).any():
                h1 += 1
            elif (column_1[row, 0] == 1).any() & (column_2[row, 0] == 0).any():
                h2 += 1
            elif (column_1[row, 0] == 0).any() & (column_2[row, 0] == 1).any():
                h3 += 1
            elif (column_1[row, 0] == 0).any() & (column_2[row, 0] == 0).any():
                h4 += 1
        colIndex1 += 2
        colIndex2 += 2
        h1_frequencies.append(h1)
        h2_frequencies.append(h2)
        h3_frequencies.append(h3)
        h4_frequencies.append(h4)
    print("H1 Frequencies (11): ", h1_frequencies)
    print("H2 Frequencies (10): ", h2_frequencies)
    print("H3 Frequencies (01): ", h3_frequencies)
    print("H4 Frequencies (00): ", h4_frequencies)

For the sample input above, this gives:

----------
H1 Frequencies (11):  [1, 1]
H2 Frequencies (10):  [1, 2]
H3 Frequencies (01):  [2, 1]
H4 Frequencies (00):  [1, 1]
----------

Which is correct, but is there a better way to do this? How can I return these results from the function for further processing?

cs95 · Accepted Answer · 2018-01-28 07:51:46Z

3

Starting with this -

x
array([[0, 1, 1, 0],
       [1, 0, 1, 0],
       [0, 1, 0, 1],
       [0, 0, 1, 1],
       [1, 1, 0, 0]])

Split your array into groups of 2 columns and concatenate them:

y = x.T
z = np.concatenate([y[i:i + 2] for i in range(0, y.shape[0], 2)], 1).T

Now, perform a broadcasted comparison and sum:

(z[:, None] == [[0, 0], [0, 1], [1, 0], [1, 1]]).all(2).sum(0)
array([2, 3, 3, 2])

If you want a per-column pair count, then you could do something like this:

def calc_haplotype_freq(x):
    counts = []
    for i in range(0, x.shape[1], 2):
        counts.append(
             (x[:, None, i:i + 2] == [[0, 0], [0, 1], [1, 0], [1, 1]]).all(2).sum(0)
        )

    return np.column_stack(counts)

calc_haplotype_freq(x)
array([[1, 1],
       [2, 1],
       [1, 2],
       [1, 1]])

edited Jan 28, 2018 at 7:51

answered Jan 28, 2018 at 7:35

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

dddxxx Over a year ago

Wow, this is great. However, I don't need total sums across all columns, I need to be able to see the unique number of combinations for all pairs of columns for downstream processing. I was actually able to solve my problem, however, I wonder if there's a way to do it using your way? PS: Thank you for responding!

cs95 Over a year ago

@Carlos [2, 3, 3, 2] is the number of combinations for [0, 0] ; [0, 1]; [1, 0]; and [1, 1] respectively. Is this not what you wanted?

dddxxx Over a year ago

@COLDSPEED yes and no, the number of combinations could be PER column pair, so the correct answer would be [1, 1, 2, 1] for columns 1 and 2, and [1, 2, 1, 1] for columns 3 and 4. Please see my edit. Your totals are correct, but they must be returned individually for further downstream analysis.

dddxxx Over a year ago

@COLDSPEED yep, you solved it, thanks a ton! Can't believe I spent four hours trying to get this to work lol.

cs95 Over a year ago

@Carlos I believe it's a step in the right direction, but as far as performance goes, I'm not sure how close/far it is from the best one out there. Still, if you're satisfied with this, then that's cool. Good luck with the rest of your work!

Collectives™ on Stack Overflow

Counting combinations over pairs of columns in a numpy array

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related