2

I have a DataFrame with some user input (it's supposed to just be a plain email address), along with some other values, like this:

import pandas as pd
from pandas import Series, DataFrame

df = pd.DataFrame({'input': ['Captain Jean-Luc Picard <[email protected]>','[email protected]','[email protected]','William Riker <[email protected]>'],'val_1':[1.5,3.6,2.4,2.9],'val_2':[7.3,-2.5,3.4,1.5]})

Due to a bug, the input sometimes has the user's name as well as brackets around the email address; this needs to be fixed before continuing with the analysis.

To move forward, I want to create a new column that has cleaned versions of the emails: if the email contains names/brackets then remove those, else just give the already correct email.

There are numerous examples of cleaning string data with Python/pandas, but I've yet to find successfully implement any of these suggestions. Here are a few examples of what I've tried:

# as noted in pandas docs, turns all non-matching strings into NaN
df['cleaned'] = df['input'].str.extract('<(.*)>')

# AttributeError: type object 'str' has no attribute 'contains'
df['cleaned'] = df['input'].apply(lambda x: str.extract('<(.*)>') if str.contains('<(.*)>') else x)

# AttributeError: 'DataFrame' object has no attribute 'str'
df['cleaned'] = df[df['input'].str.contains('<(.*)>')].str.extract('<(.*)>')

Thanks!

2 Answers 2

0

Use np.where to use the str.extract for those rows that contain the embedded email, for the else condition just return the 'input' value:

In [63]:

df['cleaned'] = np.where(df['input'].str.contains('<'), df['input'].str.extract('<(.*)>'), df['input'])

df

Out[63]:
                                            input  val_1  val_2  \
0  Captain Jean-Luc Picard <[email protected]>    1.5    7.3   
1                       [email protected]    3.6   -2.5   
2                              [email protected]    2.4    3.4   
3             William Riker <[email protected]>    2.9    1.5   

                     cleaned  
0       [email protected]  
1  [email protected]  
2         [email protected]  
3        [email protected]  
Sign up to request clarification or add additional context in comments.

Comments

0

If you want to use regular expressions:

import re
rex = re.compile(r'<(.*)>')
def fix(s):
    m = rex.search(s)
    if m is None:
        return s
    else:
        return m.groups()[0]
fixed = df['input'].apply(fix)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.