1

I have a DataFrame that include customer ticket using time.

ticket_end data is not correct and I have to use ticket_start column which is correct and I have customer ticket_name which describe how long are the tickets.

I used relativedelta(months=+numberofmonths) which is working but I have 300k rows and time is more than 2 hours so I started to find other options but all same then I tried this code again it only took 5 mins! I did not changed again but I do not know what happened but I had to start kernel again and now it is taking more than 2 hours again.

My question is I do not know why it happened? and What can we do for making datetime column process faster?

Here is my code:

for i in tqdm(range(len(customer))):
    if  customer.ticket_name[i] == '3 month free':
        customer.ticket_end[i] = customer.ticket_start[i] + relativedelta(months=+1)

    elif customer.product_name[i] == '4 month free':  
         customer.ticket_end[i] = customer.ticket_start[i] + relativedelta(months=+4)

    elif customer.product_name[i] == '6 month free':
         customer.ticket_end[i] = customer.ticket_start[i] + relativedelta(months=+6)

    elif customer.product_name[i] == '9 month free': 
         customer.ticket_end[i] = customer.ticket_start[i] + relativedelta(months=+9)

    else:
        customer.ticket_end[i] = customer.ticket_start[i] + relativedelta(months=+1)

before the code, the date columns was string and date and time '2015-01-28 17:59:50'

I do not needed so I removed the time with this:

customer['ticket_start']= pd.to_datetime(customer['ticket_start'],format='%Y-%m-%d %H:%M:%S')
customer['ticket_start'] = map(lambda x: x.date(), customer['ticket_start'])

again pd.to_datetime():

customer['ticket_start']= pd.to_datetime(customer['ticket_start'])

might be critical information I got data both from csv and from a database with mysql.connector but now both are a process 2 hours.

Thanks in advance.

1 Answer 1

1

You can use for remove times floor, then creare new column for months and last add them by DateOffset:

rng = pd.date_range('2017-01-03  15:14:01', periods=30, freq='300H')
customer = pd.DataFrame({'ticket_start': rng, 'product_name': ['3 month free'] * 5 + 
                                                              ['4 month free'] * 5 + 
                                                              ['6 month free'] * 10 +
                                                              ['9 month free'] * 5 +
                                                              ['2 month free'] * 5} )  


#print (customer)

customer['ticket_start']=(pd.to_datetime(customer['ticket_start'],format='%Y-%m-%d %H:%M:%S')
                            .dt.floor('d'))
d = {'3 month free' : 1, '4 month free': 4, '6 month free':6, '9 month free':9}
customer['m'] = customer['product_name'].map(d).fillna(1).astype(int) 


customer['ticket_end'] = customer.apply(lambda x: x['ticket_start'] + 
                                    pd.offsets.DateOffset(months=x['m']), axis=1)

print (customer)
    product_name ticket_start  m ticket_end
0   3 month free   2017-01-03  1 2017-02-03
1   3 month free   2017-01-16  1 2017-02-16
2   3 month free   2017-01-28  1 2017-02-28
3   3 month free   2017-02-10  1 2017-03-10
4   3 month free   2017-02-22  1 2017-03-22
5   4 month free   2017-03-07  4 2017-07-07
6   4 month free   2017-03-19  4 2017-07-19
7   4 month free   2017-04-01  4 2017-08-01
8   4 month free   2017-04-13  4 2017-08-13
9   4 month free   2017-04-26  4 2017-08-26
10  6 month free   2017-05-08  6 2017-11-08
11  6 month free   2017-05-21  6 2017-11-21
12  6 month free   2017-06-02  6 2017-12-02
13  6 month free   2017-06-15  6 2017-12-15
14  6 month free   2017-06-27  6 2017-12-27
15  6 month free   2017-07-10  6 2018-01-10
16  6 month free   2017-07-22  6 2018-01-22
17  6 month free   2017-08-04  6 2018-02-04
18  6 month free   2017-08-16  6 2018-02-16
19  6 month free   2017-08-29  6 2018-02-28
20  9 month free   2017-09-10  9 2018-06-10
21  9 month free   2017-09-23  9 2018-06-23
22  9 month free   2017-10-05  9 2018-07-05
23  9 month free   2017-10-18  9 2018-07-18
24  9 month free   2017-10-30  9 2018-07-30
25  2 month free   2017-11-12  1 2017-12-12
26  2 month free   2017-11-24  1 2017-12-24
27  2 month free   2017-12-07  1 2018-01-07
28  2 month free   2017-12-19  1 2018-01-19
29  2 month free   2018-01-01  1 2018-02-01
Sign up to request clarification or add additional context in comments.

11 Comments

This is a fantastic solution @jezrael. This code is not the only a minute for my data. Thank you. What is the floor, what is the behind this?
I test it deeply and I find small problem, need customer['m'] = customer['product_name'].map(d).fillna(1).astype(int) + customer['ticket_start'].dt.month instead customer['m'] = customer['product_name'].map(d).fillna(1).astype(int)
And for your second question - floor is for truncate datetimes, e.g. for remove times df.floor('d'), for remove minutes df.floor('h').
Unfortunately, I got an error with your correction code, The error: IllegalMonthError: bad month number 13; must be 1-12. Also yes I found something which is when you give 1 in fillna() the data that you did not point out ticket_start time being random. I gave 0 then all not point out data same been ticket_start
Amazing! Thanks for all @jezrael
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.