1

I have a matrix with a certain number of columns that contain only the numbers 0 and 1, I want to count the number of [0, 0], [0, 1], [1, 0], and [1, 1] in each PAIR of columns.

So for example, if I have a matrix with four columns, I want to count the number of 00s, 11s, 01s, and 11s in the first and second column, append the final result to a list, then loop over the 3rd and 4th column and append that answer to the list.

Example input:

array([[0, 1, 1, 0],
       [1, 0, 1, 0],
       [0, 1, 0, 1],
       [0, 0, 1, 1],
       [1, 1, 0, 0]])

My expected output is:

array([[1, 1],
       [2, 1],
       [1, 2],
       [1, 1]])

Explanation:

The first two columns have [0, 0] once. The second two columns also have [0, 0] once. The first two columns have [0, 1] twice, and the second two columns have [0, 1] once... and so on.


This is my latest attempt and it seems to work. Would like feedback.

# for each pair of columns calculate haplotype frequencies
# haplotypes:
# h1 = 11
# h2 = 10
# h3 = 01
# h4 = 00
# takes as input a pair of columns
def calc_haplotype_freq(matrix):
    h1_frequencies = []
    h2_frequencies = []
    h3_frequencies = []
    h4_frequencies = []
    colIndex1 = 0
    colIndex2 = 1
    for i in range(0, 2): # number of columns divided by 2
        h1 = 0
        h2 = 0
        h3 = 0
        h4 = 0
        column_1 = matrix[:, colIndex1]
        column_2 = matrix[:, colIndex2]
        for row in range(0, matrix.shape[0]):
            if (column_1[row, 0] == 1).any() & (column_2[row, 0] == 1).any():
                h1 += 1
            elif (column_1[row, 0] == 1).any() & (column_2[row, 0] == 0).any():
                h2 += 1
            elif (column_1[row, 0] == 0).any() & (column_2[row, 0] == 1).any():
                h3 += 1
            elif (column_1[row, 0] == 0).any() & (column_2[row, 0] == 0).any():
                h4 += 1
        colIndex1 += 2
        colIndex2 += 2
        h1_frequencies.append(h1)
        h2_frequencies.append(h2)
        h3_frequencies.append(h3)
        h4_frequencies.append(h4)
    print("H1 Frequencies (11): ", h1_frequencies)
    print("H2 Frequencies (10): ", h2_frequencies)
    print("H3 Frequencies (01): ", h3_frequencies)
    print("H4 Frequencies (00): ", h4_frequencies)

For the sample input above, this gives:

----------
H1 Frequencies (11):  [1, 1]
H2 Frequencies (10):  [1, 2]
H3 Frequencies (01):  [2, 1]
H4 Frequencies (00):  [1, 1]
----------

Which is correct, but is there a better way to do this? How can I return these results from the function for further processing?

1 Answer 1

3

Starting with this -

x
array([[0, 1, 1, 0],
       [1, 0, 1, 0],
       [0, 1, 0, 1],
       [0, 0, 1, 1],
       [1, 1, 0, 0]])

Split your array into groups of 2 columns and concatenate them:

y = x.T
z = np.concatenate([y[i:i + 2] for i in range(0, y.shape[0], 2)], 1).T

Now, perform a broadcasted comparison and sum:

(z[:, None] == [[0, 0], [0, 1], [1, 0], [1, 1]]).all(2).sum(0)
array([2, 3, 3, 2])

If you want a per-column pair count, then you could do something like this:

def calc_haplotype_freq(x):
    counts = []
    for i in range(0, x.shape[1], 2):
        counts.append(
             (x[:, None, i:i + 2] == [[0, 0], [0, 1], [1, 0], [1, 1]]).all(2).sum(0)
        )

    return np.column_stack(counts)

calc_haplotype_freq(x)
array([[1, 1],
       [2, 1],
       [1, 2],
       [1, 1]])
Sign up to request clarification or add additional context in comments.

5 Comments

Wow, this is great. However, I don't need total sums across all columns, I need to be able to see the unique number of combinations for all pairs of columns for downstream processing. I was actually able to solve my problem, however, I wonder if there's a way to do it using your way? PS: Thank you for responding!
@Carlos [2, 3, 3, 2] is the number of combinations for [0, 0] ; [0, 1]; [1, 0]; and [1, 1] respectively. Is this not what you wanted?
@COLDSPEED yes and no, the number of combinations could be PER column pair, so the correct answer would be [1, 1, 2, 1] for columns 1 and 2, and [1, 2, 1, 1] for columns 3 and 4. Please see my edit. Your totals are correct, but they must be returned individually for further downstream analysis.
@COLDSPEED yep, you solved it, thanks a ton! Can't believe I spent four hours trying to get this to work lol.
@Carlos I believe it's a step in the right direction, but as far as performance goes, I'm not sure how close/far it is from the best one out there. Still, if you're satisfied with this, then that's cool. Good luck with the rest of your work!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.