I am trying to convert a pandas data frame with a column populated with values such as this:
df['Alteration']
Q79K,E17K
Q79K,E17K
T315I
And would like to convert the single letter amino acid to its triple letter code to look more like this:
Gln79Lys,Glu17Lys
Gln79Lys,Glu17Lys
Thr315Ile
So far I have tried to use a dictionary using regex expression as the keys, such as this:
AA_code = {re.compile('[C]'): 'Cys',re.compile('[D]'): 'Asp',
re.compile('[S]'): 'Ser',re.compile('[Q]'): 'Gln',re.compile('[K]'): 'Lys',
re.compile('[I]'): 'Ile',re.compile('[P]'): 'Pro',re.compile('[T]'): 'Thr',
re.compile('[F]'): 'Phe',re.compile('[N]'): 'Asn',re.compile('[G]'): 'Gly',
re.compile('[H]'): 'His',re.compile('[L]'): 'Leu',re.compile('[R]'): 'Arg',
re.compile('[W]'): 'Trp',re.compile('[A]'): 'Ala',re.compile('[V]'): 'Val',
re.compile('[E]'): 'Glu',re.compile('[Y]'): 'Tyr',re.compile('[M]'): 'Met'}
And the following code to replace based on the dictionary:
df['Replacement'] = dfx2['Alteration'].replace(AA_code, regex=True)
However, I am getting some strange behaviour where the replace function is over-writing the values, to look more like this:
Glyln79Leuys,Glu17Leuys
Glyln79Leuys,Glu17Leuys
Thr315Ile
From what I understand, the Glyln is derived from the code first changing the Q to Gln, and then the G in Gln is being overwritten by the G : Gly key : value pair in the dictionary to get Glyln. Is there some way to fix this??
Thank you!!