1

I am using pyspark to do most of the data wrangling but at the end I need to convert to pandas dataframe. When converting columns that I have formatted to date become "object" dtype in pandas.

Are datetimes between pyspark and pandas incompatible? How can I keep dateformat after pyspark -> pandas dataframe convertion ?

EDIT: converting to timestamp is a workaround as suggested in other question. How can I find out more on data type compatability between pyspark vs pandas ? There is not much info on documentation

1 Answer 1

1

Checkout the spark documentation, it is more informative than the databricks documentation you linked in the question.

I think the cleanest solution is to use timestamp rather than date type in your spark code as you said.

The other way to do it (which I wouldn't recommend) would be to convert from object back to datetime in the pandas dataframe using the pandas to_datetime function. Something like this

> object_series = pd.Series(["2022-01-01", "2022-01-02"])

> df = pd.DataFrame({'dates':object_series})

> df.dtypes 
dates    object
dtype: object

> df = df.assign(dates_2=pd.to_datetime(df.dates))

> df.dtypes 
dates              object
dates_2    datetime64[ns]
dtype: object

> df 
        dates    dates_2
0  2022-01-01 2022-01-01
1  2022-01-02 2022-01-02
Sign up to request clarification or add additional context in comments.

1 Comment

Awesome, thanks. That's what I was looking for to watch out for future

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.