0

Edit: I tried the following code on my actual data and I am getting the incorrect ranges for column 1.

MAX_SIZE = 10_000_000 # max chromosome size

bins = list(range(0, MAX_SIZE, 10_000))
bins[0] = 1
labels = [f'{a}-{b}' for a,b in zip(bins, bins[1:])]

group = pd.cut(data[1], bins, labels=labels).astype(str)

out = (data.groupby([0, group, 2])
       [[3, 4, 5]].sum().reset_index()
      )

my output for column 1 should be:

1-10000

10001-20000

20001-30000

30001-40000

but I am getting:

1-10000

10000-20000

100000-110000

110000-120000

120000-130000

...

original: I have DNA sequencing data which I have already mapped to specific sites throughout the genome and returned a csv file with the number of times I mapped a sequence to a specific site. I have this with several samples and What I want to do is sum each column (sample) that falls within a range of values for the chromosome position. In other words, I have data that looks like this:

ChrA, 553, F, 3, 0, 0, 0

ChrA, 834, F, 0, 3, 1, 0

ChrA, 987, F, 1, 2, 1, 8

...

ChrB, 348, F, 1, 1, 0, 4

...

column 0 is the name of the chromosome, 1 is the nucleotide position the sequence was mapped to, and 3-6 are the number of times that sequence was mapped at that position for 4 different samples. What I want to do is take a window of values in column 1 (for example from 1 to 10,000) and sum all of the columns that fall within that range. Then I want to sum the next segment of the chromosome from position 10,001 to 20,000 and so on until the end of the chromosome for each chromosome. The output should look something like this:

ChrA, 1-10000, F, 4, 5, 2, 8

ChrA, 10001-20000, F, n, n, n, n

ChrA, 20001-30000, F, n, n, n, n

...

ChrB, 1-10000, F, n, n, n, n

...

Thank you for your help!

The only thing I have tried is the pandas .loc[] function but i am having trouble looping through the multiple chromosomes in incremental windows

2 Answers 2

0

I'm not sure if you are looking for something like this. Since your example data is not replicable I created one from the example you shared.

import pandas as pd

df = pd.DataFrame({
    'Col1': ['String' + str(i) for i in range(1, 21)],
    'Col2': range(1, 21),
    'Col3': np.random.rand(20)
})

conditions  = [df['Col2'].between(1,5), df['Col2'].between(6,10),
               df['Col2'].between(11,15), df['Col2'].between(16,20)]
choices     = [ "group1", "group2", "group3", "group4" ]
    
df["class"] = np.select(conditions, choices, default='random_group')
df["class_sum"] = df.groupby(['Col1' ,"class"])['Col3'].transform('sum')
print(df)

 Col1  Col2      Col3   class  class_sum
0    String1     1  0.509723  group1   0.509723
1    String2     2  0.387798  group1   0.387798
2    String3     3  0.106302  group1   0.106302
3    String4     4  0.576913  group1   0.576913
4    String5     5  0.068705  group1   0.068705
5    String6     6  0.802236  group2   0.802236
6    String7     7  0.511529  group2   0.511529
7    String8     8  0.846700  group2   0.846700
8    String9     9  0.785276  group2   0.785276
9   String10    10  0.912042  group2   0.912042
10  String11    11  0.607900  group3   0.607900
11  String12    12  0.842794  group3   0.842794
12  String13    13  0.779911  group3   0.779911
13  String14    14  0.964896  group3   0.964896
14  String15    15  0.983164  group3   0.983164
15  String16    16  0.753229  group4   0.753229
16  String17    17  0.739145  group4   0.739145
17  String18    18  0.915821  group4   0.915821
18  String19    19  0.338980  group4   0.338980
19  String20    20  0.698161  group4   0.698161
Sign up to request clarification or add additional context in comments.

2 Comments

With many categories, np.select will have a bad complexity as all tests are performed
I used np.select because it will give me control on the number and distribution of bins.
0

You can use cut and groupby.sum:

MAX_SIZE = 10_000_000 # max chromosome size

bins = list(range(0, MAX_SIZE, 10_000))
bins[0] = 1
labels = [f'{a}-{b}' for a,b in zip(bins, bins[1:])]

group = pd.cut(df['position'], bins, labels=labels).astype(str)

out = (df.groupby(['chromosome', group, 'strand'])
       [['A', 'B', 'C', 'D']].sum().reset_index()
      )

Alternatively, compute the group with floordiv:

s = df['position'].floordiv(10_000).add(1)
group = s.astype(str)+'-'+s.mul(10_000).astype(str)

out = (df.groupby(['chromosome', group, 'strand'])
       [['A', 'B', 'C', 'D']].sum().reset_index()
      )

Output:

  chromosome position strand  A  B  C  D
0       ChrA  1-10000      F  4  5  2  8
1       ChrB  1-10000      F  1  1  0  4

Used input:

  chromosome  position strand  A  B  C  D
0       ChrA       553      F  3  0  0  0
1       ChrA       834      F  0  3  1  0
2       ChrA       987      F  1  2  1  8
3       ChrB       348      F  1  1  0  4

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.