I'm trying to create a dask dataframe from a numpy array. For that, I need to specify the column types. As suggested in dask documentation, I use for that a pandas empty dataframe. This doesn't throw an error, however all the data types are created as object. I need to use the empty Pandas dataframe, how to make this work?
import pandas as pd
import dask.dataframe as dd
array = np.array([(1.5, 2, 3, datetime(2000,1,1)), (4, 5, 6, datetime(2001, 2, 2))])
meta = pd.DataFrame({'col1': pd.Series(dtype='float64'),
'col2': pd.Series(dtype='float64'),
'col3': pd.Series(dtype='float64'),
'date1': pd.Series(dtype='datetime64[ns]')})
print(meta.dtypes)
>>> col1 float64
>>> col2 float64
>>> col3 float64
>>> date1 datetime64[ns]
>>> dtype: object
columns = ['col1', 'col2', 'col3', 'date1']
ddf = dd.from_array(array, columns=columns, meta=meta)
ddf.compute()
print(ddf.dtypes)
>>> col1 object
>>> col2 object
>>> col3 object
>>> date1 object
>>> dtype: object
metaargument works. There is some discussion here on how to improve this.