4

I have a dataframe (df) containing columns ['toaddress', 'ccaddress', 'body']

I want to iterate through the index of the dataframe to get the min, max, and average amount of email addresses in toaddress and ccaddress fields as determined by counting the instance of and '@' within each field in those two columns

If all else fails, i guess I could just use df.toaddress.str.contains(r'@').sum() and divide that by the number of rows in the data frame to get the average, but I think it's just counting the rows that at least have 1 @ sign.

4
  • Can you provide a few rows of the data frame? Commented Aug 21, 2015 at 18:58
  • It wont let me post the image of the rows :( but in first column is an unlabled index starting at 0 and going to over 400k rows. The column toaddress has email addresses seperated by commas and sometimes \null Commented Aug 21, 2015 at 19:32
  • Note that you suggest using df.toaddress.str.contains(r'@').sum() why not use df.toaddress.str.count(r'@') if you're happy going column by column? I added an answer to do it across more than one column in one step. Commented Aug 21, 2015 at 19:55
  • @Mr. F - great point and that's what worked best. Thank you!! Commented Aug 21, 2015 at 20:33

4 Answers 4

3

You can use

df[['toaddress', 'ccaddress']].applymap(lambda x: str.count(x, '@'))

to get back the count of '@' within each cell.

Then you can just compute the pandas max, min, and mean along the row axis in the result.

As I commented on the original question, you already suggested using df.toaddress.str.contains(r'@').sum() -- why not use df.toaddress.str.count(r'@') if you're happy going column by column instead of the method I showed above?

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. To answer your question I didn't think of count(r'@')...good point. Then I can append it with .max .mean .min Excellent Point! It works. I'll also study the applymap method - thanks!
0
len(filter(lambda df: df.toaddress.str.contains(r'@'),rows))

or even

len(filter(lambda df: r'@' in str(df.toaddress), rows))

2 Comments

How does the "rows" come into play?
rows would be whatever you use to iterate over the rows of your output. the main point was to use filter on each row and then get the count of the the ones that the filter lets through with len()
0

Perhaps something like this

from pandas import *
import re

df = DataFrame({"emails": ["[email protected], [email protected]", 
                           "[email protected], none, [email protected], [email protected]"]})

at = re.compile(r"@", re.I)
def count_emails(string):
    count = 0
    for i in at.finditer(string):
        count += 1
    return count

df["count"] = df["emails"].map(count_emails)

df

Returns:

    emails                                                  count
0   "[email protected], [email protected]"                     2
1   "[email protected], none, [email protected], Th..."     3

1 Comment

Thanks Joseph. Though another answer was a bit more concise, this helps me understand how I can address another problem I'm having where the amount of combinations of fields is a small fixed number. Thank you!
0

This answer uses https://pypi.python.org/pypi/fake-factory to generate the test data

import pandas as pd
from random import randint
from faker import Factory
fake = Factory.create()

def emails():
    emailAdd = [fake.email()]
    for x in range(randint(0,3)):
        emailAdd.append(fake.email())

    return emailAdd

df1 = pd.DataFrame(columns=['toaddress', 'ccaddress', 'body'])

for extra in range(10):
    df1 = df1.append(pd.DataFrame({'toaddress':[emails()],'ccaddress':[emails()],'body':fake.text()}),ignore_index=True)

print('toaddress length is {}'.format([len(x) for x in df1.toaddress.values]))
print('ccaddress length is {}'.format([len(x) for x in df1.ccaddress.values]))

The last 2 lines is the part that counts your emails. I wasn't sure if you wanted to check for '@' specifically, maybe you can use fake-factory to generate some test data as an example?

1 Comment

Thank you! Mr. F helped me realize an elegant solution, but thanks for turning me onto fake-factory. I didn't know that existed. This python newbie is drinking from a fire hose. Thank you memebrain!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.