3

I have a MultiIndex Pandas DataFrame that looks like the following:

import pandas as pd
import numpy as np

genotype_data = [
                    ['0/1', '120,60', 180, 5, 0.5, '0/1', '200,2', 202, 99, 0.01],
                    ['0/1', '200,20', 60, 99, 0.1, '0/1', '200,50', 250, 99, 0.4],
                    ['0/1', '200,2', 202, 99, 0.01, '0/1', '200,2', 202, 99, 0.01] 
]


genotype_columns = [['Sample1', 'Sample2'], ['GT', 'AD', 'DP', 'GQ', 'AB']]
cols = pd.MultiIndex.from_product(genotype_columns)
genotype = pd.DataFrame(data=genotype_data, columns=cols)

info_columns = [['INFO'], ['AC', 'DEPTH']]
cols = pd.MultiIndex.from_product(info_columns)
info = pd.DataFrame(data=[[12, 100], [23, 200], [40, 40]], columns=cols)

df = pd.concat([info, genotype], axis=1)

I want to filter the df for any rows where at least one of the Samples (Sample1 or Sample2 in this case) has a DP >= 50 & GQ < 4. Under these conditions all rows should be filtered out except the first row.

I have no idea where to start with this and would appreciate some help.

EDIT:

I arrived at a solution thanks to the help of jezrael's post. The code is as follows:

genotype = df.ix[:,3:]

DP = genotype.xs('DP', axis=1, level=1)
GQ = genotype.xs('GQ', axis=1, level=1)

conditions = (DP.ge(50) & GQ.le(4)).T.any() 
df = df[conditions]

return df
0

2 Answers 2

4

I think you can use:

  • filter Samples columns
  • select xs DP and HQ subsets and compare with ge and lt
  • logical and (&) and get at least one True by any
  • get first True by idxmax and select by loc
#data in sample change for matching (first 99 in HQ in Sample1 was changed to 3)

genotype_data = [
                    ['0/1', '120,60', 180, 5, 0.5, '0/1', '200,2', 202, 99, 0.01],
                    ['0/1', '200,20', 60, 3, 0.1, '0/1', '200,50', 250, 99, 0.4],
                    ['0/1', '200,2', 202, 99, 0.01, '0/1', '200,2', 202, 99, 0.01] 
]


genotype_columns = [['Sample1', 'Sample2'], ['GT', 'AD', 'DP', 'GQ', 'AB']]
cols = pd.MultiIndex.from_product(genotype_columns)
genotype = pd.DataFrame(data=genotype_data, columns=cols)

info_columns = [['INFO'], ['AC', 'DEPTH']]
cols = pd.MultiIndex.from_product(info_columns)
info = pd.DataFrame(data=[[12, 100], [23, 200], [40, 40]], columns=cols)

df = pd.concat([info, genotype], axis=1)
print (df)
  INFO       Sample1                        Sample2                       
    AC DEPTH      GT      AD   DP  GQ    AB      GT      AD   DP  GQ    AB
0   12   100     0/1  120,60  180   5  0.50     0/1   200,2  202  99  0.01
1   23   200     0/1  200,20   60   3  0.10     0/1  200,50  250  99  0.40
2   40    40     0/1   200,2  202  99  0.01     0/1   200,2  202  99  0.01
df1 = df.filter(like='Sample')
df = df.loc[[(df1.xs('DP', axis=1, level=1).ge(50) & 
              df1.xs('GQ', axis=1, level=1).lt(4)).any(1).idxmax()]]
print (df)
  INFO       Sample1                     Sample2                      
    AC DEPTH      GT      AD  DP GQ   AB      GT      AD   DP  GQ   AB
1   23   200     0/1  200,20  60  3  0.1     0/1  200,50  250  99  0.4

EDIT:

If need return all rows by condition, remove loc and idmax:

df1 = df.filter(like='Sample')
#changed condition to lt(10) (<10)
df = df[(df1.xs('DP', axis=1, level=1).ge(50) & df1.xs('GQ', axis=1, level=1).lt(10)).any(1)]
print (df)
  INFO       Sample1                      Sample2                       
    AC DEPTH      GT      AD   DP GQ   AB      GT      AD   DP  GQ    AB
0   12   100     0/1  120,60  180  5  0.5     0/1   200,2  202  99  0.01
1   23   200     0/1  200,20   60  3  0.1     0/1  200,50  250  99  0.40
Sign up to request clarification or add additional context in comments.

7 Comments

I'm trying something weird and I may never get to it as I'm pretty sleepy... so ...
This doesn't work so well. If I change the conditions to DP > 0 (as below) I would expect all three columns to be returned, however, I only get the first index back (index 0). df = df.loc[[(df1.xs('DP', axis=1, level=1).ge(0)).any(1).idxmax()]]
Hmm, idxmax() id for first column only, I think you can remove it if need all rows whee condition is True.
okay. I managed to get it by playing around with what you gave me. The working code is now in edited question. Thanks for your help! :-)
Super, our idea is same, only I changed T.any() to any(1)
|
1

stack the first level and use query to identify indices

df.loc[df.stack(0).query('DP >= 50 & GQ < 4').unstack().index]

  INFO       Sample1                     Sample2                      
    AC DEPTH      GT      AD  DP GQ   AB      GT      AD   DP  GQ   AB
1   23   200     0/1  200,20  60  3  0.1     0/1  200,50  250  99  0.4

I used @jezrael's setup

genotype_data = [
                    ['0/1', '120,60', 180, 5, 0.5, '0/1', '200,2', 202, 99, 0.01],
                    ['0/1', '200,20', 60, 3, 0.1, '0/1', '200,50', 250, 99, 0.4],
                    ['0/1', '200,2', 202, 99, 0.01, '0/1', '200,2', 202, 99, 0.01] 
]


genotype_columns = [['Sample1', 'Sample2'], ['GT', 'AD', 'DP', 'GQ', 'AB']]
cols = pd.MultiIndex.from_product(genotype_columns)
genotype = pd.DataFrame(data=genotype_data, columns=cols)

info_columns = [['INFO'], ['AC', 'DEPTH']]
cols = pd.MultiIndex.from_product(info_columns)
info = pd.DataFrame(data=[[12, 100], [23, 200], [40, 40]], columns=cols)

df = pd.concat([info, genotype], axis=1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.