2

I'm trying to create a dask dataframe from a numpy array. For that, I need to specify the column types. As suggested in dask documentation, I use for that a pandas empty dataframe. This doesn't throw an error, however all the data types are created as object. I need to use the empty Pandas dataframe, how to make this work?

import pandas as pd
import dask.dataframe as dd

array = np.array([(1.5, 2, 3, datetime(2000,1,1)), (4, 5, 6, datetime(2001, 2, 2))])
meta = pd.DataFrame({'col1': pd.Series(dtype='float64'),
                   'col2': pd.Series(dtype='float64'),
                   'col3': pd.Series(dtype='float64'),
                   'date1': pd.Series(dtype='datetime64[ns]')})
print(meta.dtypes)

>>> col1            float64
>>> col2            float64
>>> col3            float64
>>> date1    datetime64[ns]
>>> dtype: object

columns = ['col1', 'col2', 'col3', 'date1']
ddf = dd.from_array(array, columns=columns, meta=meta)
ddf.compute()

print(ddf.dtypes)

>>> col1     object
>>> col2     object
>>> col3     object
>>> date1    object
>>> dtype: object
4
  • How is this different from yesterday's question? stackoverflow.com/questions/70836962/… Commented Jan 25, 2022 at 12:52
  • It’s using an empty pandas frame Commented Jan 25, 2022 at 12:54
  • It looks like this is a bug-- I would encourage you to submit an issue. Including @Alexandra Dudkina's additional solution could be helpful for debugging. Commented Jan 25, 2022 at 22:02
  • Update to my above comment-- this is not a bug, but more of a nuance around how the meta argument works. There is some discussion here on how to improve this. Commented Jan 27, 2022 at 17:39

2 Answers 2

2

Could dtypes be set after dataframe creation?

import pandas as pd
import numpy as np
from datetime import datetime
import dask.dataframe as dd

array = np.array([(1.5, 2, 3, datetime(2000,1,1)), (4, 5, 6, datetime(2001, 2, 2))])

columns = ['col1', 'col2', 'col3', 'date1']
ddf = dd.from_array(array, columns = columns)
ddf.compute()

ddf = ddf.astype({'col1': 'float64','col2':'float64','col3':'float64','date1':'datetime64[ns]'})
print(ddf.dtypes)
Sign up to request clarification or add additional context in comments.

Comments

0

Does this work -

df = (pd.DataFrame(array, columns = ["col1", "col2", "col3", "col4"])
      .astype({"col1": "float64", 
               "col2": "float64", 
               "col3": "float64", 
               "col4": "datetime64[ns]"}))
ddf = dd.from_pandas(df, npartitions=10)

The output of ddf.dtypes gives me the correct data types.

1 Comment

Thanks, but I need to create the data frame with dd.from_array

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.