pandas convert string columns to datetime, allowing missing but not invalid

Question

I have a pandas data frame with multiple columns of strings representing dates, with empty strings representing missing dates. For example

import numpy as np
import pandas as pd

# expected date format is 'm/%d/%Y'

custId = np.array(list(range(1,6)))
eventDate = np.array(["06/10/1992","08/24/2012","04/24/2015","","10/14/2009"])
registerDate = np.array(["06/08/2002","08/20/2012","04/20/2015","","10/10/2009"])

# both date columns of dfGood should convert to datetime without error
dfGood = pd.DataFrame({'custId':custId, 'eventDate':eventDate, 'registerDate':registerDate})

I am trying to:

Efficiently convert columns where all strings are valid dates or empty into columns of type datetime64 (with NaT for the empty)
Raise ValueError when any non-empty string does not conform to the expected format,

Example of where ValueError should be raised:

# 2nd string invalid
registerDate = np.array(["06/08/2002","20/08/2012","04/20/2015","","10/10/2009"]) 
# eventDate column should convert, registerDate column should raise ValueError
dfBad = pd.DataFrame({'custId':custId, 'eventDate':eventDate, 'registerDate':registerDate})

This function does what I want at the element level:

from datetime import datetime

def parseStrToDt(s, format = '%m/%d/%Y'):
    """Parse a string to datetime with the supplied format."""
    return pd.NaT if s=='' else datetime.strptime(s, format)

print(parseStrToDt("")) # correctly returns NaT
print(parseStrToDt("12/31/2011")) # correctly returns 2011-12-31 00:00:00
print(parseStrToDt("12/31/11")) # correctly raises ValueError

However, I have read that string operations shouldn't be np.vectorize-d. I thought this could be done efficiently using pandas.DataFrame.apply, as in:

dfGood[['eventDate','registerDate']].applymap(lambda s: parseStrToDt(s)) # raises TypeError

dfGood.loc[:,'eventDate'].apply(lambda s: parseStrToDt(s)) # raises same TypeError

I'm guessing that the TypeError has something to do with my function returning a different dtype, but I do want to take advantage of dynamic typing and replace the string with a datetime (unless ValueError is raise)... so how can I do this?

You can just use pd.to_datetime with param errors='coerce' so pd.to_datetime(x, errors='coerce') where x is your df column — EdChum
– EdChum, Commented Jul 26, 2016 at 15:41
@EdChum thanks but pd.to_datetime(dfBad['registerDate'], errors='coerce') does not raise ValueError, and I am looking to raise ValueError on invalid date strings. Setting errors='coerce' prevents that. — C8H10N4O2
– C8H10N4O2, Commented Jul 26, 2016 at 15:45
But the point here is that you'll get np.NaT (Not A Time) for invalid or empty strings and you can filter these out using dropna — EdChum
– EdChum, Commented Jul 26, 2016 at 15:48
@EdChum there is a difference between empty strings, which I am aware of and want to go to np.NaT, and invalid strings, which I do not expect and want to raise ValueError if they are found, as referenced in the question title and shown in the example parseStrToDt — C8H10N4O2
– C8H10N4O2, Commented Jul 26, 2016 at 15:49

chrisb · Accepted Answer · 2016-07-26 16:50:52Z

2

pandas doesn't have an option that exactly replicates what you want, here's one way to do it, which should be relatively efficient.

In [4]: dfBad
Out[4]: 
   custId   eventDate registerDate
0       1  06/10/1992   06/08/2002
1       2  08/24/2012   20/08/2012
2       3  04/24/2015   04/20/2015
3       4                         
4       5  10/14/2009   10/10/2009

In [7]: cols
Out[7]: ['eventDate', 'registerDate']

In [9]: dts = dfBad[cols].apply(lambda x: pd.to_datetime(x, errors='coerce', format='%m/%d/%Y'))

In [10]: dts
Out[10]: 
   eventDate registerDate
0 1992-06-10   2002-06-08
1 2012-08-24          NaT
2 2015-04-24   2015-04-20
3        NaT          NaT
4 2009-10-14   2009-10-10

In [11]: mask = pd.isnull(dts) & (dfBad[cols] != '')

In [12]: mask
Out[12]: 
  eventDate registerDate
0     False        False
1     False         True
2     False        False
3     False        False
4     False        False


In [13]: mask.any()
Out[13]: 
eventDate       False
registerDate     True
dtype: bool

In [14]: is_bad = mask.any()

In [23]: if is_bad.any():
    ...:     raise ValueError("bad dates in col(s) {0}".format(is_bad[is_bad].index.tolist()))
    ...: else:
    ...:     df[cols] = dts
    ...:     
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-23-579c06ce3c77> in <module>()
      1 if is_bad.any():
----> 2     raise ValueError("bad dates in col(s) {0}".format(is_bad[is_bad].index.tolist()))
      3 else:
      4     df[cols] = dts
      5 

ValueError: bad dates in col(s) ['registerDate']

edited Jul 26, 2016 at 16:50

answered Jul 26, 2016 at 16:44

chrisb

52.7k8 gold badges73 silver badges70 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

C8H10N4O2 Over a year ago

Nice, thanks. There is one more step needed, which is to identify the good columns and convert those, leaving only the bad columns with the exception (which will be handled elsewhere) but I will take that on as a learning exercise.

C8H10N4O2 · Accepted Answer · 2016-07-26 17:37:37Z

Just to take the accepted answer a little further, I replaced the columns of all valid or missing strings with their parsed datetimes, and then raised an error for the remaining unparsed columns:

dtCols = ['eventDate', 'registerDate']
dts = dfBad[dtCols].apply(lambda x: pd.to_datetime(x, errors='coerce', format='%m/%d/%Y'))

mask = pd.isnull(dts) & (dfBad[dtCols] != '')
colHasError = mask.any()

invalidCols = colHasError[colHasError].index.tolist() 
validCols = list(set(dtCols) - set(invalidCols))

dfBad[validCols] = dts[validCols] # replace the completely valid/empty string cols with dates
if colHasError.any():
    raise ValueError("bad dates in col(s) {0}".format(invalidCols))
# raises:  ValueError: bad dates in col(s) ['registerDate']

print(dfBad) # eventDate got converted, registerDate didn't

The accepted answer contains the main insight, though, which is to go ahead and coerce errors to NaT and then distinguish the non-empty but invalid strings from the empty ones with the mask.

Collectives™ on Stack Overflow

pandas convert string columns to datetime, allowing missing but not invalid

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related