2

I am trying to convert a pandas data frame with a column populated with values such as this:

df['Alteration']

Q79K,E17K
Q79K,E17K
T315I

And would like to convert the single letter amino acid to its triple letter code to look more like this:

Gln79Lys,Glu17Lys
Gln79Lys,Glu17Lys
Thr315Ile

So far I have tried to use a dictionary using regex expression as the keys, such as this:

AA_code = {re.compile('[C]'): 'Cys',re.compile('[D]'): 'Asp', 
re.compile('[S]'): 'Ser',re.compile('[Q]'): 'Gln',re.compile('[K]'): 'Lys', 
re.compile('[I]'): 'Ile',re.compile('[P]'): 'Pro',re.compile('[T]'): 'Thr', 
re.compile('[F]'): 'Phe',re.compile('[N]'): 'Asn',re.compile('[G]'): 'Gly', 
re.compile('[H]'): 'His',re.compile('[L]'): 'Leu',re.compile('[R]'): 'Arg', 
re.compile('[W]'): 'Trp',re.compile('[A]'): 'Ala',re.compile('[V]'): 'Val', 
re.compile('[E]'): 'Glu',re.compile('[Y]'): 'Tyr',re.compile('[M]'): 'Met'}

And the following code to replace based on the dictionary:

df['Replacement'] = dfx2['Alteration'].replace(AA_code, regex=True)

However, I am getting some strange behaviour where the replace function is over-writing the values, to look more like this:

Glyln79Leuys,Glu17Leuys
Glyln79Leuys,Glu17Leuys
Thr315Ile

From what I understand, the Glyln is derived from the code first changing the Q to Gln, and then the G in Gln is being overwritten by the G : Gly key : value pair in the dictionary to get Glyln. Is there some way to fix this??

Thank you!!

2 Answers 2

1

Make a single lookup table and then use it in a callable in Series.str.replace, eg:

import pandas as pd

lookup = {
    'Q': 'Gln',
    'K': 'Lys',
    'E': 'Glu',
    'G': 'Gly'
    # needs completing...
}

s = pd.Series(['Q79K,E17K', 'Q79K,E17K', 'T315I'])
s.str.replace('([{}])'.format(''.join(lookup)), lambda m: lookup[m.group(1)])

Gives you:

0    Gln79Lys,Glu17Lys
1    Gln79Lys,Glu17Lys
2                T315I
Sign up to request clarification or add additional context in comments.

Comments

0

Jon's answer is great. Following his input, another way to do it would be this,

import pandas as pd

lookup = {
    'Q': 'Gln',
    'K': 'Lys',
    'E': 'Glu',
    'G': 'Gly'
     # needs completing...
}

s = pd.Series(['Q79K,E17K', 'Q79K,E17K', 'T315I'])
s.apply(lambda row: "".join([lookup[x] if x in lookup else x for x in row]))

or, as suggested by @Jon Clements in the comment,

s.apply(lambda row: "".join([lookup.get(x,x) for x in row]))

which gives you,

0    Gln79Lys,Glu17Lys
1    Gln79Lys,Glu17Lys
2                T315I
dtype: object

2 Comments

If you're going to go for that approach then lookup.get(x, x) for x in row is a way of avoiding the explicit if/else check...
Amazing. I was looking up how to avoid the if/else check myself. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.