Python: Count instances of a specific character in all rows within a dataframe column

Question

I have a dataframe (df) containing columns ['toaddress', 'ccaddress', 'body']

I want to iterate through the index of the dataframe to get the min, max, and average amount of email addresses in toaddress and ccaddress fields as determined by counting the instance of and '@' within each field in those two columns

If all else fails, i guess I could just use df.toaddress.str.contains(r'@').sum() and divide that by the number of rows in the data frame to get the average, but I think it's just counting the rows that at least have 1 @ sign.

It wont let me post the image of the rows :( but in first column is an unlabled index starting at 0 and going to over 400k rows. The column toaddress has email addresses seperated by commas and sometimes \null — bluechips
– bluechips, Commented Aug 21, 2015 at 19:32
Note that you suggest using df.toaddress.str.contains(r'@').sum() why not use df.toaddress.str.count(r'@') if you're happy going column by column? I added an answer to do it across more than one column in one step. — ely
– ely, Commented Aug 21, 2015 at 19:55
@Mr. F - great point and that's what worked best. Thank you!! — bluechips
– bluechips, Commented Aug 21, 2015 at 20:33

ely · Accepted Answer · 2015-08-21 19:57:04Z

3

You can use

df[['toaddress', 'ccaddress']].applymap(lambda x: str.count(x, '@'))

to get back the count of '@' within each cell.

Then you can just compute the pandas max, min, and mean along the row axis in the result.

As I commented on the original question, you already suggested using df.toaddress.str.contains(r'@').sum() -- why not use df.toaddress.str.count(r'@') if you're happy going column by column instead of the method I showed above?

edited Aug 21, 2015 at 19:57

answered Aug 21, 2015 at 19:51

ely

77.8k36 gold badges158 silver badges234 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

bluechips Over a year ago

Thank you. To answer your question I didn't think of count(r'@')...good point. Then I can append it with .max .mean .min Excellent Point! It works. I'll also study the applymap method - thanks!

Dmitry Rubanovich · Accepted Answer · 2015-08-21 19:42:49Z

0

len(filter(lambda df: df.toaddress.str.contains(r'@'),rows))

or even

len(filter(lambda df: r'@' in str(df.toaddress), rows))

edited Aug 21, 2015 at 19:42

answered Aug 21, 2015 at 19:37

Dmitry Rubanovich

2,63721 silver badges30 bronze badges

2 Comments

bluechips Over a year ago

How does the "rows" come into play?

Dmitry Rubanovich Over a year ago

rows would be whatever you use to iterate over the rows of your output. the main point was to use filter on each row and then get the count of the the ones that the filter lets through with len()

Joseph Stover · Accepted Answer · 2015-08-21 19:47:02Z

0

Perhaps something like this

from pandas import *
import re

df = DataFrame({"emails": ["[email protected], [email protected]", 
                           "[email protected], none, [email protected], [email protected]"]})

at = re.compile(r"@", re.I)
def count_emails(string):
    count = 0
    for i in at.finditer(string):
        count += 1
    return count

df["count"] = df["emails"].map(count_emails)

df

Returns:

    emails                                                  count
0   "[email protected], [email protected]"                     2
1   "[email protected], none, [email protected], Th..."     3

answered Aug 21, 2015 at 19:47

Joseph Stover

4274 silver badges13 bronze badges

1 Comment

bluechips Over a year ago

Thanks Joseph. Though another answer was a bit more concise, this helps me understand how I can address another problem I'm having where the amount of combinations of fields is a small fixed number. Thank you!

memebrain · Accepted Answer · 2015-08-21 20:45:08Z

0

This answer uses https://pypi.python.org/pypi/fake-factory to generate the test data

import pandas as pd
from random import randint
from faker import Factory
fake = Factory.create()

def emails():
    emailAdd = [fake.email()]
    for x in range(randint(0,3)):
        emailAdd.append(fake.email())

    return emailAdd

df1 = pd.DataFrame(columns=['toaddress', 'ccaddress', 'body'])

for extra in range(10):
    df1 = df1.append(pd.DataFrame({'toaddress':[emails()],'ccaddress':[emails()],'body':fake.text()}),ignore_index=True)

print('toaddress length is {}'.format([len(x) for x in df1.toaddress.values]))
print('ccaddress length is {}'.format([len(x) for x in df1.ccaddress.values]))

The last 2 lines is the part that counts your emails. I wasn't sure if you wanted to check for '@' specifically, maybe you can use fake-factory to generate some test data as an example?

answered Aug 21, 2015 at 20:45

memebrain

4033 silver badges9 bronze badges

1 Comment

bluechips Over a year ago

Thank you! Mr. F helped me realize an elegant solution, but thanks for turning me onto fake-factory. I didn't know that existed. This python newbie is drinking from a fire hose. Thank you memebrain!

Collectives™ on Stack Overflow

Python: Count instances of a specific character in all rows within a dataframe column

4 Answers 4

1 Comment

2 Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

2 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related