Edit: I tried the following code on my actual data and I am getting the incorrect ranges for column 1.
MAX_SIZE = 10_000_000 # max chromosome size
bins = list(range(0, MAX_SIZE, 10_000))
bins[0] = 1
labels = [f'{a}-{b}' for a,b in zip(bins, bins[1:])]
group = pd.cut(data[1], bins, labels=labels).astype(str)
out = (data.groupby([0, group, 2])
[[3, 4, 5]].sum().reset_index()
)
my output for column 1 should be:
1-10000
10001-20000
20001-30000
30001-40000
but I am getting:
1-10000
10000-20000
100000-110000
110000-120000
120000-130000
...
original: I have DNA sequencing data which I have already mapped to specific sites throughout the genome and returned a csv file with the number of times I mapped a sequence to a specific site. I have this with several samples and What I want to do is sum each column (sample) that falls within a range of values for the chromosome position. In other words, I have data that looks like this:
ChrA, 553, F, 3, 0, 0, 0
ChrA, 834, F, 0, 3, 1, 0
ChrA, 987, F, 1, 2, 1, 8
...
ChrB, 348, F, 1, 1, 0, 4
...
column 0 is the name of the chromosome, 1 is the nucleotide position the sequence was mapped to, and 3-6 are the number of times that sequence was mapped at that position for 4 different samples. What I want to do is take a window of values in column 1 (for example from 1 to 10,000) and sum all of the columns that fall within that range. Then I want to sum the next segment of the chromosome from position 10,001 to 20,000 and so on until the end of the chromosome for each chromosome. The output should look something like this:
ChrA, 1-10000, F, 4, 5, 2, 8
ChrA, 10001-20000, F, n, n, n, n
ChrA, 20001-30000, F, n, n, n, n
...
ChrB, 1-10000, F, n, n, n, n
...
Thank you for your help!
The only thing I have tried is the pandas .loc[] function but i am having trouble looping through the multiple chromosomes in incremental windows