Filtering a MultiIndex Pandas DataFrame

Question

I have a MultiIndex Pandas DataFrame that looks like the following:

import pandas as pd
import numpy as np

genotype_data = [
                    ['0/1', '120,60', 180, 5, 0.5, '0/1', '200,2', 202, 99, 0.01],
                    ['0/1', '200,20', 60, 99, 0.1, '0/1', '200,50', 250, 99, 0.4],
                    ['0/1', '200,2', 202, 99, 0.01, '0/1', '200,2', 202, 99, 0.01] 
]


genotype_columns = [['Sample1', 'Sample2'], ['GT', 'AD', 'DP', 'GQ', 'AB']]
cols = pd.MultiIndex.from_product(genotype_columns)
genotype = pd.DataFrame(data=genotype_data, columns=cols)

info_columns = [['INFO'], ['AC', 'DEPTH']]
cols = pd.MultiIndex.from_product(info_columns)
info = pd.DataFrame(data=[[12, 100], [23, 200], [40, 40]], columns=cols)

df = pd.concat([info, genotype], axis=1)

I want to filter the df for any rows where at least one of the Samples (Sample1 or Sample2 in this case) has a DP >= 50 & GQ < 4. Under these conditions all rows should be filtered out except the first row.

I have no idea where to start with this and would appreciate some help.

EDIT:

I arrived at a solution thanks to the help of jezrael's post. The code is as follows:

genotype = df.ix[:,3:]

DP = genotype.xs('DP', axis=1, level=1)
GQ = genotype.xs('GQ', axis=1, level=1)

conditions = (DP.ge(50) & GQ.le(4)).T.any() 
df = df[conditions]

return df

jezrael · Accepted Answer · 2017-03-31 17:07:41Z

4

I think you can use:

filter Samples columns
select xs DP and HQ subsets and compare with ge and lt
logical and (&) and get at least one True by any
get first True by idxmax and select by loc

#data in sample change for matching (first 99 in HQ in Sample1 was changed to 3)

genotype_data = [
                    ['0/1', '120,60', 180, 5, 0.5, '0/1', '200,2', 202, 99, 0.01],
                    ['0/1', '200,20', 60, 3, 0.1, '0/1', '200,50', 250, 99, 0.4],
                    ['0/1', '200,2', 202, 99, 0.01, '0/1', '200,2', 202, 99, 0.01] 
]


genotype_columns = [['Sample1', 'Sample2'], ['GT', 'AD', 'DP', 'GQ', 'AB']]
cols = pd.MultiIndex.from_product(genotype_columns)
genotype = pd.DataFrame(data=genotype_data, columns=cols)

info_columns = [['INFO'], ['AC', 'DEPTH']]
cols = pd.MultiIndex.from_product(info_columns)
info = pd.DataFrame(data=[[12, 100], [23, 200], [40, 40]], columns=cols)

df = pd.concat([info, genotype], axis=1)
print (df)
  INFO       Sample1                        Sample2                       
    AC DEPTH      GT      AD   DP  GQ    AB      GT      AD   DP  GQ    AB
0   12   100     0/1  120,60  180   5  0.50     0/1   200,2  202  99  0.01
1   23   200     0/1  200,20   60   3  0.10     0/1  200,50  250  99  0.40
2   40    40     0/1   200,2  202  99  0.01     0/1   200,2  202  99  0.01

df1 = df.filter(like='Sample')
df = df.loc[[(df1.xs('DP', axis=1, level=1).ge(50) & 
              df1.xs('GQ', axis=1, level=1).lt(4)).any(1).idxmax()]]
print (df)
  INFO       Sample1                     Sample2                      
    AC DEPTH      GT      AD  DP GQ   AB      GT      AD   DP  GQ   AB
1   23   200     0/1  200,20  60  3  0.1     0/1  200,50  250  99  0.4

EDIT:

If need return all rows by condition, remove loc and idmax:

df1 = df.filter(like='Sample')
#changed condition to lt(10) (<10)
df = df[(df1.xs('DP', axis=1, level=1).ge(50) & df1.xs('GQ', axis=1, level=1).lt(10)).any(1)]
print (df)
  INFO       Sample1                      Sample2                       
    AC DEPTH      GT      AD   DP GQ   AB      GT      AD   DP  GQ    AB
0   12   100     0/1  120,60  180  5  0.5     0/1   200,2  202  99  0.01
1   23   200     0/1  200,20   60  3  0.1     0/1  200,50  250  99  0.40

edited Mar 31, 2017 at 17:07

answered Mar 31, 2017 at 10:19

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

piRSquared Over a year ago

I'm trying something weird and I may never get to it as I'm pretty sleepy... so ...

David Ross Over a year ago

This doesn't work so well. If I change the conditions to DP > 0 (as below) I would expect all three columns to be returned, however, I only get the first index back (index 0). df = df.loc[[(df1.xs('DP', axis=1, level=1).ge(0)).any(1).idxmax()]]

jezrael Over a year ago

Hmm, idxmax() id for first column only, I think you can remove it if need all rows whee condition is True.

David Ross Over a year ago

okay. I managed to get it by playing around with what you gave me. The working code is now in edited question. Thanks for your help! :-)

jezrael Over a year ago

Super, our idea is same, only I changed T.any() to any(1)

|

piRSquared · Accepted Answer · 2017-03-31 10:41:26Z

stack the first level and use query to identify indices

df.loc[df.stack(0).query('DP >= 50 & GQ < 4').unstack().index]

  INFO       Sample1                     Sample2                      
    AC DEPTH      GT      AD  DP GQ   AB      GT      AD   DP  GQ   AB
1   23   200     0/1  200,20  60  3  0.1     0/1  200,50  250  99  0.4

I used @jezrael's setup

genotype_data = [
                    ['0/1', '120,60', 180, 5, 0.5, '0/1', '200,2', 202, 99, 0.01],
                    ['0/1', '200,20', 60, 3, 0.1, '0/1', '200,50', 250, 99, 0.4],
                    ['0/1', '200,2', 202, 99, 0.01, '0/1', '200,2', 202, 99, 0.01] 
]


genotype_columns = [['Sample1', 'Sample2'], ['GT', 'AD', 'DP', 'GQ', 'AB']]
cols = pd.MultiIndex.from_product(genotype_columns)
genotype = pd.DataFrame(data=genotype_data, columns=cols)

info_columns = [['INFO'], ['AC', 'DEPTH']]
cols = pd.MultiIndex.from_product(info_columns)
info = pd.DataFrame(data=[[12, 100], [23, 200], [40, 40]], columns=cols)

df = pd.concat([info, genotype], axis=1)

Collectives™ on Stack Overflow

Filtering a MultiIndex Pandas DataFrame

2 Answers 2

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related