8

I'm using Pandas' read_sql() function to read multiple SQL tables into DataFrames. This function requires a pre-specified list of column names which should be read as datetime objects using the 'parse_dates' parameter but the function does not infer datetimes automatically from varchar columns in the server. Because of this, I get DataFrames in which all columns are of dtype Object.

   col1    col2
   -----------------------------------
0  A       2017-02-04 10:41:00.0000000
1  B       2017-02-04 10:41:00.0000000
2  C       2017-02-04 10:41:00.0000000
3  D       2017-02-04 10:41:00.0000000
4  E       2017-02-03 06:13:00.0000000

Is there a built-in Pandas function to automatically infer columns which should be datetime64[ns] WITHOUT having to specify the column names?

I've tried:

df.apply(pd.to_datetime(df, infer_datetime_format=True), axis=1)

which results in an error:

to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing

I also tried:

pd.to_datetime(df.stack(),  errors='ignore', format='%Y%m%d% H%M%S%f').unstack()

and

pd.to_datetime(df.stack(),  errors='coerce', format='%Y%m%d% H%M%S%f').unstack()

But this does not work.

Any suggestions about how to infer datetime columns automatically after the DataFrame is constructed?

4
  • In my opinion it does not exist. You always need specify column(s) for parse datetime. Commented Aug 12, 2017 at 17:57
  • I don't think you can. Numbers can be converted to datetime as well so your integers and floats are also at risk. If you have a specific format, you can check that but other than that you don't have many options. Commented Aug 12, 2017 at 17:59
  • It's not the best idea to store dates as varchar's in the database. You can't use date arithmetic, and all functions that operate with dates/timestamps. In some RDBMSs there is a special optimization for date/timestamp indices/primary keys, etc. Another drawback - is your Pandas problem. If you would have date data type on the DB side Pandas would read it properly as datetime dtype. That said i would consider fixing data types on the DB side... Commented Aug 12, 2017 at 20:05
  • Thanks for the comments. I ended up doing a custom function to parse them akin to the suggestions below. Commented Aug 15, 2017 at 18:46

4 Answers 4

10

There is no builtin currently to convert object to datetime automatically. One simple way is based on list comprehension and regex pattern of the datetime varchar ie.

If you have a df (based on @Alexander's df)

df = pd.DataFrame( {'col1': ['A', 'B', 'C', 'D', 'E'],
  'col2': ['2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-03 14:13:00'],
  'col3': [0, 1, 2, 3, 4],
  'col4': ['2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-03 14:13:00']})

data = [pd.to_datetime(df[x]) if df[x].astype(str).str.match(r'\d{4}-\d{2}-\d{2} \d{2}\:\d{2}\:\d{2}').all() else df[x] for x in df.columns]

df = pd.concat(data, axis=1, keys=[s.name for s in data])

or with the help of a mask i.e

mask = df.astype(str).apply(lambda x : x.str.match(r'\d{4}-\d{2}-\d{2} \d{2}\:\d{2}\:\d{2}').all())
df.loc[:,mask] = df.loc[:,mask].apply(pd.to_datetime)

df.types

Output:

col1            object
col2    datetime64[ns]
col3             int64
col4    datetime64[ns]
dtype: object

If you have mixed date formats then you can use r'(\d{2,4}-\d{2}-\d{2,4})+' Eg:

ndf = pd.DataFrame({'col3': [0, 1, 2, 3, 4],
  'col4': ['2017-02-04 18:41:00',
       '2017-02-04 18:41:00',
       '2017-02-04 18:41:00',
       '2017-02-04 18:41:00',
       '2017-02-03 14:13:00'],
  'col5': ['2017-02-04',
       '2017-02-04',
       '17-02-2004 14:13:00',
       '17-02-2014',
       '2017-02-03']})

mask = ndf.astype(str).apply(lambda x : x.str.match(r'(\d{2,4}-\d{2}-\d{2,4})+').all())
ndf.loc[:,mask] = ndf.loc[:,mask].apply(pd.to_datetime)

Output :

   col3                col4                col5
0     0 2017-02-04 18:41:00 2017-02-04 00:00:00
1     1 2017-02-04 18:41:00 2017-02-04 00:00:00
2     2 2017-02-04 18:41:00 2004-02-17 14:13:00
3     3 2017-02-04 18:41:00 2014-02-17 00:00:00
4     4 2017-02-03 14:13:00 2017-02-03 00:00:00

Hope it helps

Sign up to request clarification or add additional context in comments.

2 Comments

Glad to help sir
I use this solution but I have this error: Must have equal len keys and value when setting with an iterable
7

You can identify which columns in your dataframe are of type object and then only convert those columns to datetime using coerce=True so that errors are generated for columns which cannot be converted. Use combine_first to overwrite the values in your dataframe with the timestamp values that did convert to datetimes.

df = pd.DataFrame(
     {'col1': ['A', 'B', 'C', 'D', 'E'],
      'col2': ['2017-02-04 18:41:00',
               '2017-02-04 18:41:00',
               '2017-02-04 18:41:00',
               '2017-02-04 18:41:00',
               '2017-02-03 14:13:00'],
      'col3': [0, 1, 2, 3, 4]})

object_cols = [col for col, col_type in df.dtypes.iteritems() if col_type == 'object']

df.loc[:, object_cols] = df[object_cols].combine_first(df[object_cols].apply(
    pd.to_datetime, coerce=True))
>>> df
  col1                col2  col3
0    A 2017-02-04 18:41:00     0
1    B 2017-02-04 18:41:00     1
2    C 2017-02-04 18:41:00     2
3    D 2017-02-04 18:41:00     3
4    E 2017-02-03 14:13:00     4

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 3 columns):
col1    5 non-null object
col2    5 non-null datetime64[ns]
col3    5 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 160.0+ bytes

Comments

0

Use different regex formats for dates - This code uses following formats - yyyy/mm/dd+ or yyyy-m-dd+ or yyyy\mm\dd+ or yyyy.mm.dd+.

Below code will find the datetime columns:

  def format_date(df):
        non_numeric_cols = [col for col, col_type in df.dtypes.iteritems() if col_type == 'object' or is_datetime(df[col])]
        if len(non_numeric_cols) > 0:
            mask = df.astype(str).apply(lambda x : x.str.match('(\d{2,4}(-|\/|\\|\.| )\d{2}(-|\/|\\|\.| )\d{2,4})+').any())
            df.loc[:,mask] = df.loc[:,mask].apply(pd.to_datetime,dayfirst=False)
        return df

Comments

0

If all these doesn't work out, I have a workaround which sort of works although there are certain assumptions. Here I have assumed that datetime columns already present in the dataset would have 'year' or 'month' or 'date' or 'day' in their name, so further I'm using Regex to bring all of them out.

normzd_column_names=[x.lower() for x in df.columns]

pattern='year|month|date|day'
for col in normzd_column_names:
    match=re.search(pattern,col)
    if match:
       print(f'Datetime column found: {col}')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.