2

I have the following code:

datetime_const = datetime(2021, 3, 31)
tmp_df1['datetime2'] = pd.to_datetime(tmp_df1['datetime1'], format='%Y-%m-%d')
tmp_df1['test_col_1'] = (tmp_df1['value1'] < 0.0002) & (tmp_df1['datetime2'] < (datetime_const + pd.DateOffset(months=12)))
tmp_df1['test_col_2'] = (tmp_df1['value1'] >= 0.0002) & ((((tmp_df1['datetime2'] - datetime_const ).dt.days/365)*tmp_df1['value1']) < 0.0002)
tmp_df1['test_col_3'] = datetime_const + pd.DateOffset(months=12)
tmp_df1['test_col_4'] = datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
tmp_df1['test_col_5'] = tmp_df1['datetime2']
tmp_df1['datetime3'] = np.select(
    [
        (tmp_df1['value1'] < 0.0002) & (tmp_df1['datetime2'] < (datetime_const + pd.DateOffset(months=12))),
        (tmp_df1['value1'] >= 0.0002) & ((((tmp_df1['datetime2'] - datetime_const ).dt.days/365)*tmp_df1['value1']) < 0.0002)
    ],
    [
        datetime_const + pd.DateOffset(months=12),
        datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
    ],
    default=tmp_df1['datetime2']
)

datetime1 is an object dtype, so i converted it to datetime64, as datetime2 is assigned as.

value1 is a float dtype column with a bunch of decimal numbers, it does have NaNs.

I created test_col_1 to test_col_5 to check the individual conditions and choices within my np.select function, they all seem correct when assigned as individual df columns.

However, my datetime3 column assignment, from the np.select function, returns some weird object dtype large numbers, like 160000000000. I would expect it to return either a datetime64 value from one of the two choices, or the default datetime2 column value.

Please see the sample .info and df rows below:

Data columns (total 8 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   datetime2                   26558 non-null  datetime64[ns]
 1   value1                      25438 non-null  float64       
 2   test_col_1                  26558 non-null  bool          
 3   test_col_2                  26558 non-null  bool          
 4   test_col_3                  26558 non-null  datetime64[ns]
 5   test_col_4                  25438 non-null  datetime64[ns]
 6   test_col_5                  26558 non-null  datetime64[ns]
 7   datetime3                   26558 non-null  object        
dtypes: bool(2), datetime64[ns](4), float64(1), object(1)
memory usage: 1.5+ MB

            datetime2   value1  test_col_1  test_col_2 test_col_3 test_col_4 test_col_5        datetime3
0           2021-06-30 0.00058       False        True 2022-03-31 2021-08-05 2021-06-30        1628121600000000000
1           2022-03-31 0.00044       False       False 2022-03-31 2021-09-13 2022-03-31        1648684800000000000
2           2024-06-07 0.00860       False       False 2022-03-31 2021-04-08 2024-06-07        1717718400000000000
3           2021-09-30 0.00867       False       False 2022-03-31 2021-04-08 2021-09-30        1632960000000000000
4           2021-08-31 0.00144       False       False 2022-03-31 2021-05-21 2021-08-31        1630368000000000000
5           2021-08-31 0.00144       False       False 2022-03-31 2021-05-21 2021-08-31        1630368000000000000
6           2021-04-08 0.00474       False        True 2022-03-31 2021-04-15 2021-04-08        1618444800000000000
7           2023-10-01 0.11506       False       False 2022-03-31 2021-04-01 2023-10-01        1696118400000000000
8           2023-09-29 0.12067       False       False 2022-03-31 2021-04-01 2023-09-29        1695945600000000000
9           2021-05-31 0.02508       False       False 2022-03-31 2021-04-03 2021-05-31        1622419200000000000

I am completely baffled by this behavior, please enlighten me!

Thank you all in advance!

1
  • 1
    @Ben.T Good point, I added some examples of what I'm seeing. Thank you. Commented Sep 1, 2021 at 21:13

1 Answer 1

2

It looks like there is conversion of the dates to the representation in int64 from epoch time when using np.select. An easy fix is to convert after with astype

# dummy
tmp_df1 = pd.DataFrame([['2021-06-30', 0.00058],['2023-10-01', 0.11506 ]],
                       columns= ['datetime2','value1'])
tmp_df1['datetime2'] = pd.to_datetime(tmp_df1['datetime2'], format='%Y-%m-%d')


tmp_df1['datetime3'] = np.select(
    [
        (tmp_df1['value1'] < 0.0002) & (tmp_df1['datetime2'] < (datetime_const + pd.DateOffset(months=12))),
        (tmp_df1['value1'] >= 0.0002) & ((((tmp_df1['datetime2'] - datetime_const ).dt.days/365)*tmp_df1['value1']) < 0.0002)
    ],
    [
        datetime_const + pd.DateOffset(months=12),
        datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
    ],
    default=tmp_df1['datetime2']
).astype('datetime64[ns]') ### <--- add this

print(tmp_df1)
   datetime2   value1  datetime3
0 2021-06-30  0.00058 2021-08-04
1 2023-10-01  0.11506 2023-10-01

Longer explanation

I think that the problem is in your two choices, because one of them is a single value (the first one) and the second is a Series. You can see that it works when the second choice is a Series too (with datetime dtype)

# dummy
tmp_df1 = pd.DataFrame([['2021-06-30', 0.00058],['2023-10-01', 0.11506 ]],
                       columns= ['datetime2','value1'])
tmp_df1['datetime2'] = pd.to_datetime(tmp_df1['datetime2'], format='%Y-%m-%d')

if I use your method I get the long integer representation (like you)

np.select(
    ...
    [
        datetime_const + pd.DateOffset(months=12),
        datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
    ],...
)
# gives
array([1628035200000000000, 1696118400000000000], dtype=object)

but replacing the datetime_const in the first choice by creating a Series (not related to your use case)

np.select(
    ...
    [
        tmp_df1['datetime2'] + pd.DateOffset(months=12), # here replace the constant by the column datetime2 for example
        datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
    ],
    ...
)
# get the good date format (wrong value of course)
array(['2021-08-04T00:00:00.000000000', '2023-10-01T00:00:00.000000000'],
      dtype='datetime64[ns]')
Sign up to request clarification or add additional context in comments.

2 Comments

I know the comment shouldn't be used for this, but I still think it needs to be expressed. Thank you for taking the time to look into this, this solved my problem nicely. Good to know that different object type choices in the np.select can cause behaviors like this.
@rluo glad it helps :) and yes, comments are not technically for this but it is still nice to get thanks from people :) and you can still delete your comment later ;)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.