Replacing strings in pandas data frame using dictionary without overwriting

Question

I am trying to convert a pandas data frame with a column populated with values such as this:

df['Alteration']

Q79K,E17K
Q79K,E17K
T315I

And would like to convert the single letter amino acid to its triple letter code to look more like this:

Gln79Lys,Glu17Lys
Gln79Lys,Glu17Lys
Thr315Ile

So far I have tried to use a dictionary using regex expression as the keys, such as this:

AA_code = {re.compile('[C]'): 'Cys',re.compile('[D]'): 'Asp', 
re.compile('[S]'): 'Ser',re.compile('[Q]'): 'Gln',re.compile('[K]'): 'Lys', 
re.compile('[I]'): 'Ile',re.compile('[P]'): 'Pro',re.compile('[T]'): 'Thr', 
re.compile('[F]'): 'Phe',re.compile('[N]'): 'Asn',re.compile('[G]'): 'Gly', 
re.compile('[H]'): 'His',re.compile('[L]'): 'Leu',re.compile('[R]'): 'Arg', 
re.compile('[W]'): 'Trp',re.compile('[A]'): 'Ala',re.compile('[V]'): 'Val', 
re.compile('[E]'): 'Glu',re.compile('[Y]'): 'Tyr',re.compile('[M]'): 'Met'}

And the following code to replace based on the dictionary:

df['Replacement'] = dfx2['Alteration'].replace(AA_code, regex=True)

However, I am getting some strange behaviour where the replace function is over-writing the values, to look more like this:

Glyln79Leuys,Glu17Leuys
Glyln79Leuys,Glu17Leuys
Thr315Ile

From what I understand, the Glyln is derived from the code first changing the Q to Gln, and then the G in Gln is being overwritten by the G : Gly key : value pair in the dictionary to get Glyln. Is there some way to fix this??

Thank you!!

Jon Clements · Accepted Answer · 2018-08-19 13:13:29Z

1

Make a single lookup table and then use it in a callable in Series.str.replace, eg:

import pandas as pd

lookup = {
    'Q': 'Gln',
    'K': 'Lys',
    'E': 'Glu',
    'G': 'Gly'
    # needs completing...
}

s = pd.Series(['Q79K,E17K', 'Q79K,E17K', 'T315I'])
s.str.replace('([{}])'.format(''.join(lookup)), lambda m: lookup[m.group(1)])

Gives you:

0    Gln79Lys,Glu17Lys
1    Gln79Lys,Glu17Lys
2                T315I

edited Aug 19, 2018 at 13:13

answered Aug 19, 2018 at 12:13

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Raunaq Jain · Accepted Answer · 2018-08-19 14:52:52Z

0

Jon's answer is great. Following his input, another way to do it would be this,

import pandas as pd

lookup = {
    'Q': 'Gln',
    'K': 'Lys',
    'E': 'Glu',
    'G': 'Gly'
     # needs completing...
}

s = pd.Series(['Q79K,E17K', 'Q79K,E17K', 'T315I'])
s.apply(lambda row: "".join([lookup[x] if x in lookup else x for x in row]))

or, as suggested by @Jon Clements in the comment,

s.apply(lambda row: "".join([lookup.get(x,x) for x in row]))

which gives you,

0    Gln79Lys,Glu17Lys
1    Gln79Lys,Glu17Lys
2                T315I
dtype: object

edited Aug 19, 2018 at 14:52

answered Aug 19, 2018 at 13:26

Raunaq Jain

9177 silver badges13 bronze badges

2 Comments

Jon Clements Over a year ago

If you're going to go for that approach then lookup.get(x, x) for x in row is a way of avoiding the explicit if/else check...

Raunaq Jain Over a year ago

Amazing. I was looking up how to avoid the if/else check myself. Thanks!

Collectives™ on Stack Overflow

Replacing strings in pandas data frame using dictionary without overwriting

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related