what can make faster a date column process for python

Question

I have a DataFrame that include customer ticket using time.

ticket_end data is not correct and I have to use ticket_start column which is correct and I have customer ticket_name which describe how long are the tickets.

I used relativedelta(months=+numberofmonths) which is working but I have 300k rows and time is more than 2 hours so I started to find other options but all same then I tried this code again it only took 5 mins! I did not changed again but I do not know what happened but I had to start kernel again and now it is taking more than 2 hours again.

My question is I do not know why it happened? and What can we do for making datetime column process faster?

Here is my code:

for i in tqdm(range(len(customer))):
    if  customer.ticket_name[i] == '3 month free':
        customer.ticket_end[i] = customer.ticket_start[i] + relativedelta(months=+1)

    elif customer.product_name[i] == '4 month free':  
         customer.ticket_end[i] = customer.ticket_start[i] + relativedelta(months=+4)

    elif customer.product_name[i] == '6 month free':
         customer.ticket_end[i] = customer.ticket_start[i] + relativedelta(months=+6)

    elif customer.product_name[i] == '9 month free': 
         customer.ticket_end[i] = customer.ticket_start[i] + relativedelta(months=+9)

    else:
        customer.ticket_end[i] = customer.ticket_start[i] + relativedelta(months=+1)

before the code, the date columns was string and date and time '2015-01-28 17:59:50'

I do not needed so I removed the time with this:

customer['ticket_start']= pd.to_datetime(customer['ticket_start'],format='%Y-%m-%d %H:%M:%S')
customer['ticket_start'] = map(lambda x: x.date(), customer['ticket_start'])

again pd.to_datetime():

customer['ticket_start']= pd.to_datetime(customer['ticket_start'])

might be critical information I got data both from csv and from a database with mysql.connector but now both are a process 2 hours.

Thanks in advance.

jezrael · Accepted Answer · 2018-02-13 12:15:48Z

1

You can use for remove times floor, then creare new column for months and last add them by DateOffset:

rng = pd.date_range('2017-01-03  15:14:01', periods=30, freq='300H')
customer = pd.DataFrame({'ticket_start': rng, 'product_name': ['3 month free'] * 5 + 
                                                              ['4 month free'] * 5 + 
                                                              ['6 month free'] * 10 +
                                                              ['9 month free'] * 5 +
                                                              ['2 month free'] * 5} )  


#print (customer)

customer['ticket_start']=(pd.to_datetime(customer['ticket_start'],format='%Y-%m-%d %H:%M:%S')
                            .dt.floor('d'))
d = {'3 month free' : 1, '4 month free': 4, '6 month free':6, '9 month free':9}
customer['m'] = customer['product_name'].map(d).fillna(1).astype(int) 


customer['ticket_end'] = customer.apply(lambda x: x['ticket_start'] + 
                                    pd.offsets.DateOffset(months=x['m']), axis=1)

print (customer)
    product_name ticket_start  m ticket_end
0   3 month free   2017-01-03  1 2017-02-03
1   3 month free   2017-01-16  1 2017-02-16
2   3 month free   2017-01-28  1 2017-02-28
3   3 month free   2017-02-10  1 2017-03-10
4   3 month free   2017-02-22  1 2017-03-22
5   4 month free   2017-03-07  4 2017-07-07
6   4 month free   2017-03-19  4 2017-07-19
7   4 month free   2017-04-01  4 2017-08-01
8   4 month free   2017-04-13  4 2017-08-13
9   4 month free   2017-04-26  4 2017-08-26
10  6 month free   2017-05-08  6 2017-11-08
11  6 month free   2017-05-21  6 2017-11-21
12  6 month free   2017-06-02  6 2017-12-02
13  6 month free   2017-06-15  6 2017-12-15
14  6 month free   2017-06-27  6 2017-12-27
15  6 month free   2017-07-10  6 2018-01-10
16  6 month free   2017-07-22  6 2018-01-22
17  6 month free   2017-08-04  6 2018-02-04
18  6 month free   2017-08-16  6 2018-02-16
19  6 month free   2017-08-29  6 2018-02-28
20  9 month free   2017-09-10  9 2018-06-10
21  9 month free   2017-09-23  9 2018-06-23
22  9 month free   2017-10-05  9 2018-07-05
23  9 month free   2017-10-18  9 2018-07-18
24  9 month free   2017-10-30  9 2018-07-30
25  2 month free   2017-11-12  1 2017-12-12
26  2 month free   2017-11-24  1 2017-12-24
27  2 month free   2017-12-07  1 2018-01-07
28  2 month free   2017-12-19  1 2018-01-19
29  2 month free   2018-01-01  1 2018-02-01

edited Feb 13, 2018 at 12:15

answered Feb 13, 2018 at 7:21

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Axis Over a year ago

This is a fantastic solution @jezrael. This code is not the only a minute for my data. Thank you. What is the floor, what is the behind this?

jezrael Over a year ago

I test it deeply and I find small problem, need customer['m'] = customer['product_name'].map(d).fillna(1).astype(int) + customer['ticket_start'].dt.month instead customer['m'] = customer['product_name'].map(d).fillna(1).astype(int)

jezrael Over a year ago

And for your second question - floor is for truncate datetimes, e.g. for remove times df.floor('d'), for remove minutes df.floor('h').

Axis Over a year ago

Unfortunately, I got an error with your correction code, The error: IllegalMonthError: bad month number 13; must be 1-12. Also yes I found something which is when you give 1 in fillna() the data that you did not point out ticket_start time being random. I gave 0 then all not point out data same been ticket_start

Axis Over a year ago

Amazing! Thanks for all @jezrael

|

Collectives™ on Stack Overflow

what can make faster a date column process for python

1 Answer 1

11 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

11 Comments

Your Answer

Sign up or log in

Post as a guest

Related