1

I have the following function where df is a pandas dataframe that is 159538 rows x 3 columns:

dfs = []
for i in df['email_address']:
    data = df[df['email_address'] == i]
    data['difference'] = data['ts_placed'].diff().astype('timedelta64[D]')
    repeat = []
    for a in data['difference']:
        if a > 10:
            repeat.append(0)
        elif a <= 10:
            repeat.append(1)
        else:
            repeat.append(0)
    data['repeat'] = repeat
    dfs.append(data)

the function runs extremely slow. I would like to speed up the process by using multiprocessing. This SO question shows how to do this in R. What is the equivalent code for python?

this is a sample of the data after running:

df['difference'] = df.groupby('email_address')['ts_placed'].diff()



df
Out[6]: 
                           email_address           ts_placed       difference
0                  [email protected] 2015-08-06 00:00:34              NaT
1              [email protected] 2015-08-06 00:05:38              NaT
2        [email protected] 2015-08-06 00:09:20              NaT
3                    [email protected] 2015-08-06 00:10:01              NaT
4       terry.wfdfdfdfdfy-holdings.co.uk 2015-08-06 00:14:00              NaT
5              [email protected] 2015-08-06 00:14:00              NaT
6                 [email protected] 2015-08-06 00:14:00              NaT
7           [email protected] 2015-08-06 00:14:20              NaT
8                   [email protected] 2015-08-06 00:14:43              NaT
9               [email protected] 2015-08-06 00:17:03              NaT
10             [email protected] 2015-08-06 00:17:58              NaT
...
22                   [email protected] 2015-08-06 00:46:12  0 days 00:04:15
2
  • It looks all you're doing is grouping on email addresses then calculating diff on those groups and then assigning a count if the diff is larger than 10 correct? Also if this is pandas then tag it so Commented Nov 9, 2015 at 10:50
  • Can you post raw input data and code to reproduce your df to avoid any ambiguities Commented Nov 9, 2015 at 10:56

1 Answer 1

1

IIUC then you can do the following:

df['difference'] = df.groupby('email_address')['ts_placed'].diff()

df['repeat'] = df.groupby('email_address')['difference'].transform(lambda x: (x < 10).cumcount())
Sign up to request clarification or add additional context in comments.

5 Comments

Can you post your data, code and the error, I've edited it
this is the error I get AttributeError: 'Series' object has no attribute 'cumcount'. I cannot post data due to confidentiality. However, the email_address column is a string while the ts_placed column is a datetime64[ns] both of them have no missing values. On the other hand, the column difference is a timedelta64[ns] and has several NaT values
How about: df.groupby('email_address')['difference'].transform(lambda x: x < 10).cumcount()
Can you not just anonymise the data and just post a small snippet, it's difficult to post answers and interpret your errors without data and code
OK, can you try this: df['repeat'] = df.groupby('email_address')['difference'].transform(lambda x: (x.dt.seconds < 10).sum())

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.