I am attempting to alter values in multiple columns based on corresponding values in other columns. I have been able to do this by hard coding, but I would appreciate any help in automating the following code so it can be replicated for any number of samples. Below, I share a minimal example input, ideal output and the working code. Note - I am still a bit green in python so comments go a long way.
Input-
01_s_IDX_type 01_s_IDX 02_s_IDY_type 02_s_IDY
HET 0/1:10,9:19:99:202,0,244 HET 0/1:18,1:19:99:202,0,244
HOM 0/1:20,0:20:99:202,0,244 HOM 0/1:50,0:50:99:202,0,244
Here, values from the IDX column are used to re-value the IDX_type columns. The information of interest are the 3rd and 4th integers in the IDX columns: 10,9 and 18,1. For sample 01 the ratio between 10:9 is between 0.7-1.3 so its type can stay as HET. For sample 02 the ratio between 18:1 is not between 0.7-1.3 so it's type is changed to REF.
Output-
01_s_IDX_type 01_s_IDX 02_s_IDY_type 02_s_IDY
HET 0/1:10,9:19:99:202,0,244 REF 0/1:18,1:19:99:202,0,244
HOM 0/1:20,0:20:99:202,0,244 HOM 0/1:50,0:50:99:202,0,244
Here is the code that achieved this.
#Create toy example
df = {'01_s_IDX_type': ['HET', 'HOM'],
'01_s_IDX': ['0/1:10,9:19:99:202,0,244', '0/1:20,0:20:99:202,0,244'],
'02_s_IDX_type': ['REF', 'HOM'],
'02_s_IDX': ['0/1:18,1:19:99:202,0,244', '0/1:0,50:50:99:202,0,244']
}
df = pd.DataFrame(df)
print (df)
#create new dfs for each sample
df_01, df_02 = df.filter(regex=r'^01'), df.filter(regex=r'^02')
#make copy of the info column
df_01_copy = df_01['01_s_IDX']
df_02_copy = df_02['02_s_IDX']
#remove unneeded parts of the column (first four characters)
df_01_copy = df_01_copy.str[4:]
df_02_copy = df_02_copy.str[4:]
#replace all commas with colons
df_01_copy = df_01_copy.replace(to_replace =',', value = ':', regex = True)
df_02_copy = df_02_copy.replace(to_replace =',', value = ':', regex = True)
#split into new columns by :
df_01_copy = df_01_copy.str.split(pat=':',expand=True)
df_02_copy = df_02_copy.str.split(pat=':',expand=True)
#keep first two columns
df_01_copy = df_01_copy.iloc[:,:2]
df_02_copy = df_02_copy.iloc[:,:2]
#rename columns
df_01_copy.columns = ['DP1', 'DP2']
df_02_copy.columns = ['DP1', 'DP2']
#convert to numeric, calculate ratios and add the ratios to OG dfs
df_01_copy = df_01_copy.apply(pd.to_numeric)
df_01['ratio'] = df_01_copy.DP1.div(df_01_copy.DP2)
df_02_copy = df_02_copy.apply(pd.to_numeric)
df_02['ratio'] = df_02_copy.DP1.div(df_02_copy.DP2)
#Keep HET if ratio is between 1.3-0.7, if ratio = 0 then HOM
df_01.loc[(df_01['ratio'] > 1.3), '01_s_IDX_type'] = 'REF'
df_01.loc[(df_01['ratio'] < 0.7), '01_s_IDX_type'] = 'REF'
df_01.loc[(df_01['ratio'] == 0), '01_s_IDX_type'] = 'HOM'
df_02.loc[(df_02['ratio'] > 1.3), '02_s_IDX_type'] = 'REF'
df_02.loc[(df_02['ratio'] < 0.7), '02_s_IDX_type'] = 'REF'
df_02.loc[(df_02['ratio'] == 0 ), '02_s_IDX_type'] = 'HOM'
#Rejoin
df_het = pd.concat([df_01, df_02, axis=1, join="outer")
df_out = df_het.drop('ratio', axis=1)
I have datasets which may consist of n samples, so turning this code into a pipeline/ function would be ideal. Thanks in advance for any help on this.