How to automatically detect columns that contain datetime in a pandas dataframe

Question

I'm using Pandas' read_sql() function to read multiple SQL tables into DataFrames. This function requires a pre-specified list of column names which should be read as datetime objects using the 'parse_dates' parameter but the function does not infer datetimes automatically from varchar columns in the server. Because of this, I get DataFrames in which all columns are of dtype Object.

   col1    col2
   -----------------------------------
0  A       2017-02-04 10:41:00.0000000
1  B       2017-02-04 10:41:00.0000000
2  C       2017-02-04 10:41:00.0000000
3  D       2017-02-04 10:41:00.0000000
4  E       2017-02-03 06:13:00.0000000

Is there a built-in Pandas function to automatically infer columns which should be datetime64[ns] WITHOUT having to specify the column names?

I've tried:

df.apply(pd.to_datetime(df, infer_datetime_format=True), axis=1)

which results in an error:

to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing

I also tried:

pd.to_datetime(df.stack(),  errors='ignore', format='%Y%m%d% H%M%S%f').unstack()

and

pd.to_datetime(df.stack(),  errors='coerce', format='%Y%m%d% H%M%S%f').unstack()

But this does not work.

Any suggestions about how to infer datetime columns automatically after the DataFrame is constructed?

In my opinion it does not exist. You always need specify column(s) for parse datetime. — jezrael
– jezrael, Commented Aug 12, 2017 at 17:57
I don't think you can. Numbers can be converted to datetime as well so your integers and floats are also at risk. If you have a specific format, you can check that but other than that you don't have many options. — user2285236
– user2285236, Commented Aug 12, 2017 at 17:59
It's not the best idea to store dates as varchar's in the database. You can't use date arithmetic, and all functions that operate with dates/timestamps. In some RDBMSs there is a special optimization for date/timestamp indices/primary keys, etc. Another drawback - is your Pandas problem. If you would have date data type on the DB side Pandas would read it properly as datetime dtype. That said i would consider fixing data types on the DB side... — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Aug 12, 2017 at 20:05
Thanks for the comments. I ended up doing a custom function to parse them akin to the suggestions below. — Carlos Muñiz
– Carlos Muñiz, Commented Aug 15, 2017 at 18:46

Bharath M Shetty · Accepted Answer · 2017-08-13 06:30:41Z

10

There is no builtin currently to convert object to datetime automatically. One simple way is based on list comprehension and regex pattern of the datetime varchar ie.

If you have a df (based on @Alexander's df)

df = pd.DataFrame( {'col1': ['A', 'B', 'C', 'D', 'E'],
  'col2': ['2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-03 14:13:00'],
  'col3': [0, 1, 2, 3, 4],
  'col4': ['2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-03 14:13:00']})

data = [pd.to_datetime(df[x]) if df[x].astype(str).str.match(r'\d{4}-\d{2}-\d{2} \d{2}\:\d{2}\:\d{2}').all() else df[x] for x in df.columns]

df = pd.concat(data, axis=1, keys=[s.name for s in data])

or with the help of a mask i.e

mask = df.astype(str).apply(lambda x : x.str.match(r'\d{4}-\d{2}-\d{2} \d{2}\:\d{2}\:\d{2}').all())
df.loc[:,mask] = df.loc[:,mask].apply(pd.to_datetime)

df.types

Output:

col1            object
col2    datetime64[ns]
col3             int64
col4    datetime64[ns]
dtype: object

If you have mixed date formats then you can use r'(\d{2,4}-\d{2}-\d{2,4})+' Eg:

ndf = pd.DataFrame({'col3': [0, 1, 2, 3, 4],
  'col4': ['2017-02-04 18:41:00',
       '2017-02-04 18:41:00',
       '2017-02-04 18:41:00',
       '2017-02-04 18:41:00',
       '2017-02-03 14:13:00'],
  'col5': ['2017-02-04',
       '2017-02-04',
       '17-02-2004 14:13:00',
       '17-02-2014',
       '2017-02-03']})

mask = ndf.astype(str).apply(lambda x : x.str.match(r'(\d{2,4}-\d{2}-\d{2,4})+').all())
ndf.loc[:,mask] = ndf.loc[:,mask].apply(pd.to_datetime)

Output :

   col3                col4                col5
0     0 2017-02-04 18:41:00 2017-02-04 00:00:00
1     1 2017-02-04 18:41:00 2017-02-04 00:00:00
2     2 2017-02-04 18:41:00 2004-02-17 14:13:00
3     3 2017-02-04 18:41:00 2014-02-17 00:00:00
4     4 2017-02-03 14:13:00 2017-02-03 00:00:00

Hope it helps

edited Aug 13, 2017 at 6:30

answered Aug 13, 2017 at 4:55

Bharath M Shetty

30.6k6 gold badges65 silver badges111 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Bharath M Shetty Over a year ago

Glad to help sir

Alex Over a year ago

I use this solution but I have this error: Must have equal len keys and value when setting with an iterable

Alexander · Accepted Answer · 2017-08-12 18:28:39Z

You can identify which columns in your dataframe are of type object and then only convert those columns to datetime using coerce=True so that errors are generated for columns which cannot be converted. Use combine_first to overwrite the values in your dataframe with the timestamp values that did convert to datetimes.

df = pd.DataFrame(
     {'col1': ['A', 'B', 'C', 'D', 'E'],
      'col2': ['2017-02-04 18:41:00',
               '2017-02-04 18:41:00',
               '2017-02-04 18:41:00',
               '2017-02-04 18:41:00',
               '2017-02-03 14:13:00'],
      'col3': [0, 1, 2, 3, 4]})

object_cols = [col for col, col_type in df.dtypes.iteritems() if col_type == 'object']

df.loc[:, object_cols] = df[object_cols].combine_first(df[object_cols].apply(
    pd.to_datetime, coerce=True))
>>> df
  col1                col2  col3
0    A 2017-02-04 18:41:00     0
1    B 2017-02-04 18:41:00     1
2    C 2017-02-04 18:41:00     2
3    D 2017-02-04 18:41:00     3
4    E 2017-02-03 14:13:00     4

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 3 columns):
col1    5 non-null object
col2    5 non-null datetime64[ns]
col3    5 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 160.0+ bytes

hp_elite · Accepted Answer · 2020-08-17 19:44:01Z

0

Use different regex formats for dates - This code uses following formats - yyyy/mm/dd+ or yyyy-m-dd+ or yyyy\mm\dd+ or yyyy.mm.dd+.

Below code will find the datetime columns:

  def format_date(df):
        non_numeric_cols = [col for col, col_type in df.dtypes.iteritems() if col_type == 'object' or is_datetime(df[col])]
        if len(non_numeric_cols) > 0:
            mask = df.astype(str).apply(lambda x : x.str.match('(\d{2,4}(-|\/|\\|\.| )\d{2}(-|\/|\\|\.| )\d{2,4})+').any())
            df.loc[:,mask] = df.loc[:,mask].apply(pd.to_datetime,dayfirst=False)
        return df

answered Aug 17, 2020 at 19:44

hp_elite

1881 silver badge6 bronze badges

Comments

Ayan · Accepted Answer · 2023-02-08 06:55:25Z

0

If all these doesn't work out, I have a workaround which sort of works although there are certain assumptions. Here I have assumed that datetime columns already present in the dataset would have 'year' or 'month' or 'date' or 'day' in their name, so further I'm using Regex to bring all of them out.

normzd_column_names=[x.lower() for x in df.columns]

pattern='year|month|date|day'
for col in normzd_column_names:
    match=re.search(pattern,col)
    if match:
       print(f'Datetime column found: {col}')

answered Feb 8, 2023 at 6:55

Ayan

236 bronze badges

Collectives™ on Stack Overflow

How to automatically detect columns that contain datetime in a pandas dataframe

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related