I have a pandas data frame with multiple columns of strings representing dates, with empty strings representing missing dates. For example
import numpy as np
import pandas as pd
# expected date format is 'm/%d/%Y'
custId = np.array(list(range(1,6)))
eventDate = np.array(["06/10/1992","08/24/2012","04/24/2015","","10/14/2009"])
registerDate = np.array(["06/08/2002","08/20/2012","04/20/2015","","10/10/2009"])
# both date columns of dfGood should convert to datetime without error
dfGood = pd.DataFrame({'custId':custId, 'eventDate':eventDate, 'registerDate':registerDate})
I am trying to:
- Efficiently convert columns where all strings are valid dates or empty into columns of type
datetime64(withNaTfor the empty) - Raise
ValueErrorwhen any non-empty string does not conform to the expected format,
Example of where ValueError should be raised:
# 2nd string invalid
registerDate = np.array(["06/08/2002","20/08/2012","04/20/2015","","10/10/2009"])
# eventDate column should convert, registerDate column should raise ValueError
dfBad = pd.DataFrame({'custId':custId, 'eventDate':eventDate, 'registerDate':registerDate})
This function does what I want at the element level:
from datetime import datetime
def parseStrToDt(s, format = '%m/%d/%Y'):
"""Parse a string to datetime with the supplied format."""
return pd.NaT if s=='' else datetime.strptime(s, format)
print(parseStrToDt("")) # correctly returns NaT
print(parseStrToDt("12/31/2011")) # correctly returns 2011-12-31 00:00:00
print(parseStrToDt("12/31/11")) # correctly raises ValueError
However, I have read that string operations shouldn't be np.vectorize-d. I thought this could be done efficiently using pandas.DataFrame.apply, as in:
dfGood[['eventDate','registerDate']].applymap(lambda s: parseStrToDt(s)) # raises TypeError
dfGood.loc[:,'eventDate'].apply(lambda s: parseStrToDt(s)) # raises same TypeError
I'm guessing that the TypeError has something to do with my function returning a different dtype, but I do want to take advantage of dynamic typing and replace the string with a datetime (unless ValueError is raise)... so how can I do this?
pd.to_datetimewith paramerrors='coerce'sopd.to_datetime(x, errors='coerce')wherexis your df columnpd.to_datetime(dfBad['registerDate'], errors='coerce')does not raiseValueError, and I am looking to raiseValueErroron invalid date strings. Settingerrors='coerce'prevents that.np.NaT(Not A Time) for invalid or empty strings and you can filter these out usingdropnanp.NaT, and invalid strings, which I do not expect and want to raiseValueErrorif they are found, as referenced in the question title and shown in the exampleparseStrToDt