I have a MultiIndex Pandas DataFrame that looks like the following:
import pandas as pd
import numpy as np
genotype_data = [
['0/1', '120,60', 180, 5, 0.5, '0/1', '200,2', 202, 99, 0.01],
['0/1', '200,20', 60, 99, 0.1, '0/1', '200,50', 250, 99, 0.4],
['0/1', '200,2', 202, 99, 0.01, '0/1', '200,2', 202, 99, 0.01]
]
genotype_columns = [['Sample1', 'Sample2'], ['GT', 'AD', 'DP', 'GQ', 'AB']]
cols = pd.MultiIndex.from_product(genotype_columns)
genotype = pd.DataFrame(data=genotype_data, columns=cols)
info_columns = [['INFO'], ['AC', 'DEPTH']]
cols = pd.MultiIndex.from_product(info_columns)
info = pd.DataFrame(data=[[12, 100], [23, 200], [40, 40]], columns=cols)
df = pd.concat([info, genotype], axis=1)
I want to filter the df for any rows where at least one of the Samples (Sample1 or Sample2 in this case) has a DP >= 50 & GQ < 4. Under these conditions all rows should be filtered out except the first row.
I have no idea where to start with this and would appreciate some help.
EDIT:
I arrived at a solution thanks to the help of jezrael's post. The code is as follows:
genotype = df.ix[:,3:]
DP = genotype.xs('DP', axis=1, level=1)
GQ = genotype.xs('GQ', axis=1, level=1)
conditions = (DP.ge(50) & GQ.le(4)).T.any()
df = df[conditions]
return df