python pandas: vectorized function value error "lengths do not match"

Question

I have a pandas dataframe:

'customer_id','transaction_dt','product','price','units'
1,2004-01-02 00:00:00,thing1,25,47
1,2004-01-17 00:00:00,thing2,150,8
2,2004-01-29 00:00:00,thing2,150,25
3,2017-07-15 00:00:00,thing3,55,17
3,2016-05-12 00:00:00,thing3,55,47
4,2012-02-23 00:00:00,thing2,150,22
4,2009-10-10 00:00:00,thing1,25,12
4,2014-04-04 00:00:00,thing2,150,2
5,2008-07-09 00:00:00,thing2,150,43
5,2004-01-30 00:00:00,thing1,25,40
5,2004-01-31 00:00:00,thing1,25,22
5,2004-02-01 00:00:00,thing1,25,2

And I have written the following to apply date range columns:

import numpy as np
import pandas as pd

start_date_period = pd.period_range('2004-01-01', '12-31-2017', freq='30D')
end_date_period = pd.period_range('2004-01-30', '12-31-2017', freq='30D')

def find_window_start_date(x):
    window_start_date_idx = np.argmax(x < start_date_period.end_time)
    return start_date_period[window_start_date_idx]

df['window_start_dt'] = df['transaction_dt'].apply(find_window_start_date)

def find_window_end_date(x):
    window_end_date_idx = np.argmin(x > end_date_period.start_time)
    return end_date_period[window_end_date_idx]

df['window_end_dt'] = df['transaction_dt'].apply(find_window_end_date)

However this is very slow, so I have been trying to vectorize it:

import numpy as np
import pandas as pd

start_date_period = pd.period_range('2004-01-01', '12-31-2017', freq='30D')
end_date_period = pd.period_range('2004-01-30', '12-31-2017', freq='30D')

def find_window_start_date(x):
    window_start_date_idx = np.argmax(x < start_date_period.end_time)
    return start_date_period[window_start_date_idx]

df['window_start_dt'] = find_window_start_date(df['transaction_dt'].values)

def find_window_end_date(x):
    window_end_date_idx = np.argmin(x > end_date_period.start_time)
    return end_date_period[window_end_date_idx]

df['window_end_dt'] = find_window_end_date(df['transaction_dt'].values)

However this produces a value error: "Lengths must match to compare". I am new to trying to write vectorized functions from scratch, so I would appreciate any insight into where I am going awry.

EDIT:

Here is the full error message:

ValueError                                Traceback (most recent call last)
<ipython-input-11-a781075467c5> in <module>()
      5     return start_date_period[window_start_date_idx]
      6 
----> 7 df['window_start_dt'] = find_window_start_date(df['transaction_dt'].values)
      8 
      9 def find_window_end_date(x):

<ipython-input-11-a781075467c5> in find_window_start_date(x)
      2 
      3 def find_window_start_date(x):
----> 4     window_start_date_idx = np.argmax(x < start_date_period.end_time)
      5     return start_date_period[window_start_date_idx]
      6 

C:\Users\AppData\Local\Continuum\anaconda2\lib\site-packages\pandas\core\ops.pyc in wrapper(self, other, axis)
    826             if (not is_scalar(lib.item_from_zerodim(other)) and
    827                     len(self) != len(other)):
--> 828                 raise ValueError('Lengths must match to compare')
    829 
    830             if isinstance(other, ABCPeriodIndex):

ValueError: Lengths must match to compare

EDIT:

I ended up finding an edge condition with the original solution when there are collisions on the first/last days of a 30 day window. I have made some changes to get closer to a robust solution now:

start_date_range = pd.date_range('2004-01-01 00:00:00', '12-31-2017 00:00:00', freq='30D')
end_date_range = pd.date_range('2004-01-30 23:59:59', '12-31-2017 23:59:59', freq='30D')

tra = df['transaction_dt'].values[:, None]
idx1 = np.argmax(start_date_range.values < tra, axis=1)
idx2 = np.argmax(end_date_range.values > tra, axis=1)

df['window_start_dt'] = start_date_range[idx1]
df['window_end_dt'] = end_date_range[idx2]

However, this is still not working correctly because it only sets 'window_start_dt' to the lowest/first value in the date range: '2004-01-01'. Good news is it should be faster yet again.

EDIT:

I added an answer with the solution to the date collision issue below based on jezrael's answer

EDIT

Turns out there was still one more edge case. Please see jezrael's answer here for the final solution:Numpy: conditional np.where replace

df['window_start_dt'] = df['transaction_dt'].apply(find_window_start_date) and then from further on the trace window_start_date_idx = np.argmax(x < start_date_period.end_time) — Pylander
– Pylander, Commented Nov 30, 2017 at 19:01
Catch the error and inspect the data in the except suite --> maybe print x and start_date_period.end_time to see what they are then try the comparison, with those values, in the shell. — wwii
– wwii, Commented Nov 30, 2017 at 20:30

Pylander · Accepted Answer · 2017-12-12 00:49:05Z

2

I actually found a particularly clever fix for the date collision issue which is just a small change from @jezrael's fantastic answer:

It occurred to me that start_date_range and end_date_range have the exact same number of elements and that, in this specific case, you can just reuse the index created from end_date_range and apply it via np.take to start_date_range to get the correct intervals every time:

start_date_range = pd.date_range('2004-01-01 00:00:00', '12-31-2017 00:00:00', freq='30D')
end_date_range = pd.date_range('2004-01-30 23:59:59', '12-31-2017 23:59:59', freq='30D')

tra = df['transaction_dt'].values[:, None]
idx = np.argmax(end_date_range.values > tra, axis=1)

df['window_start_dt'] = np.take(start_date_range, idx)
df['window_end_dt'] = end_date_range[idx]

It is also even faster now!

answered Dec 12, 2017 at 0:49

Pylander

1,6011 gold badge20 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Pylander Over a year ago

it turns out that this was not quite right in the end. @jezrael provided the final necessary tweak so look here: stackoverflow.com/questions/47765551/…

jezrael · Accepted Answer · 2017-12-01 06:49:22Z

1

I believe you can use broadcasting:

start_date_period = pd.period_range('2004-01-01', '12-31-2017', freq='30D')
end_date_period = pd.period_range('2004-01-30', '12-31-2017', freq='30D')

tra = df['transaction_dt'].values[:, None]
idx1 = np.argmax(tra < start_date_period.end_time.values, axis=1)
idx2 = np.argmin(tra > end_date_period.start_time.values, axis=1)

df['window_start_dt'] = start_date_period[idx1]
df['window_end_dt'] = end_date_period[idx2]
print (df)

   customer_id transaction_dt product  price  units window_start_dt  \
0            1     2004-01-02  thing1     25     47      2004-01-01   
1            1     2004-01-17  thing2    150      8      2004-01-01   
2            2     2004-01-29  thing2    150     25      2004-01-01   
3            3     2017-07-15  thing3     55     17      2017-06-21   
4            3     2016-05-12  thing3     55     47      2016-04-27   
5            4     2012-02-23  thing2    150     22      2012-02-18   
6            4     2009-10-10  thing1     25     12      2009-10-01   
7            4     2014-04-04  thing2    150      2      2014-03-09   
8            5     2008-07-09  thing2    150     43      2008-07-08   

  window_end_dt  
0    2004-01-30  
1    2004-01-30  
2    2004-01-30  
3    2017-07-20  
4    2016-05-26  
5    2012-03-18  
6    2009-10-30  
7    2014-04-07  
8    2008-08-06

answered Dec 1, 2017 at 6:49

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

6 Comments

Pylander Over a year ago

Wow, this is an amazing optimization! I get a 110x speed up compared to the original, row-wise apply. I had forgotten that broadcasting was an option.

Pylander Over a year ago

I have noticed one issue with this approach, unfortunately. I am getting collisions when transaction_dt occurs on the last day of a window like '2004-01-30' and the first day like '2004-0-31'. Do you see a solution to that? Thanks again!

jezrael Over a year ago

I check it and I have question, what is desired output? Is possible some dates outside ranges start_date_period and end_date_period ? Because the best is filter problematic values and set them manually. And there is 3 possibilities - NaT, some scalar dates (min or max in start_date_period or dates are not changed - only copied). Thanks for clarification.

Pylander Over a year ago

Thanks! Sure for example if I have a transaction_dt of '2004-01-30' it is currently being put into the wrong window: window_start_dt:'2004-01-31' and window_end_dt:'2004-01-30'. This should be in the window_start_dt:'2004-01-01' and window_end_dt:'2004-01-30'.

Pylander Over a year ago

I have updated the question with an example data change and a new approach I have been trying that seems to be close now.

|

wwii · Accepted Answer · 2017-11-30 20:36:08Z

I'll sidestep the error you are getting for mismatching lengths.

This still uses .apply so it may not be what you want, but maybe simplifying it will help:

Make a column of pandas.Period's with a month frequency.

df['window'] = df.transaction_dt.apply(pd.Period, freq='M')

> df
=>            transaction_dt product  price  units  window
customer_id                                             
1               2004-01-02  thing1     25     47 2004-01
1               2004-01-17  thing2    150      8 2004-01
2               2004-01-29  thing2    150     25 2004-01
3               2017-07-15  thing3     55     17 2017-07
3               2016-05-12  thing3     55     47 2016-05
4               2012-02-23  thing2    150     22 2012-02
4               2009-10-10  thing1     25     12 2009-10
4               2014-04-04  thing2    150      2 2014-04
5               2008-07-09  thing2    150     43 2008-07

Each Period has a start_time and end_time attribute which is easy to use

>   df.window.dt.start_time
=> customer_id
1   2004-01-01
1   2004-01-01
2   2004-01-01
3   2017-07-01
3   2016-05-01
4   2012-02-01
4   2009-10-01
4   2014-04-01
5   2008-07-01
Name: window, dtype: datetime64[ns]

If you really need a start and end column;

df['start'] = df.window.dt.start_time
df['end'] = df.window.dt.end_time

>   df.head()
=>             transaction_dt product  price  units  window      start        end
customer_id                                                                   
1               2004-01-02  thing1     25     47 2004-01 2004-01-01 2004-01-31
1               2004-01-17  thing2    150      8 2004-01 2004-01-01 2004-01-31
2               2004-01-29  thing2    150     25 2004-01 2004-01-01 2004-01-31
3               2017-07-15  thing3     55     17 2017-07 2017-07-01 2017-07-31
3               2016-05-12  thing3     55     47 2016-05 2016-05-01 2016-05-31

Unfortunately, I'd like to be able to utilize arbitrary date ranges I can define instead on calendar dates. This is much faster though.

Collectives™ on Stack Overflow

python pandas: vectorized function value error "lengths do not match"

3 Answers 3

1 Comment

6 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related