I have a matrix with a certain number of columns that contain only the numbers 0 and 1, I want to count the number of [0, 0], [0, 1], [1, 0], and [1, 1] in each PAIR of columns.
So for example, if I have a matrix with four columns, I want to count the number of 00s, 11s, 01s, and 11s in the first and second column, append the final result to a list, then loop over the 3rd and 4th column and append that answer to the list.
Example input:
array([[0, 1, 1, 0],
[1, 0, 1, 0],
[0, 1, 0, 1],
[0, 0, 1, 1],
[1, 1, 0, 0]])
My expected output is:
array([[1, 1],
[2, 1],
[1, 2],
[1, 1]])
Explanation:
The first two columns have [0, 0] once. The second two columns also have [0, 0] once. The first two columns have [0, 1] twice, and the second two columns have [0, 1] once... and so on.
This is my latest attempt and it seems to work. Would like feedback.
# for each pair of columns calculate haplotype frequencies
# haplotypes:
# h1 = 11
# h2 = 10
# h3 = 01
# h4 = 00
# takes as input a pair of columns
def calc_haplotype_freq(matrix):
h1_frequencies = []
h2_frequencies = []
h3_frequencies = []
h4_frequencies = []
colIndex1 = 0
colIndex2 = 1
for i in range(0, 2): # number of columns divided by 2
h1 = 0
h2 = 0
h3 = 0
h4 = 0
column_1 = matrix[:, colIndex1]
column_2 = matrix[:, colIndex2]
for row in range(0, matrix.shape[0]):
if (column_1[row, 0] == 1).any() & (column_2[row, 0] == 1).any():
h1 += 1
elif (column_1[row, 0] == 1).any() & (column_2[row, 0] == 0).any():
h2 += 1
elif (column_1[row, 0] == 0).any() & (column_2[row, 0] == 1).any():
h3 += 1
elif (column_1[row, 0] == 0).any() & (column_2[row, 0] == 0).any():
h4 += 1
colIndex1 += 2
colIndex2 += 2
h1_frequencies.append(h1)
h2_frequencies.append(h2)
h3_frequencies.append(h3)
h4_frequencies.append(h4)
print("H1 Frequencies (11): ", h1_frequencies)
print("H2 Frequencies (10): ", h2_frequencies)
print("H3 Frequencies (01): ", h3_frequencies)
print("H4 Frequencies (00): ", h4_frequencies)
For the sample input above, this gives:
----------
H1 Frequencies (11): [1, 1]
H2 Frequencies (10): [1, 2]
H3 Frequencies (01): [2, 1]
H4 Frequencies (00): [1, 1]
----------
Which is correct, but is there a better way to do this? How can I return these results from the function for further processing?