Converting pyspark DataFrame with date column to Pandas results in AttributeError

Question

I have the following data frame (pyspark) -

 |-- DATE: date (nullable = true)
 |-- ID: string (nullable = true)
 |-- A: double (nullable = true)
 |-- B: double (nullable = true)

While trying to convert the data frame to pandas -

res2 = res.toPandas()

I got the following error - AttributeError: Can only use .dt accessor with datetimelike values

Detailed error -

    AttributeError                            Traceback (most recent call last)
<ipython-input-29-471067d510fa> in <module>
----> 1 res2 = res.toPandas()

/opt/anaconda/lib/python3.7/site-packages/pyspark/sql/dataframe.py in toPandas(self)
   2123                         table = pyarrow.Table.from_batches(batches)
   2124                         pdf = table.to_pandas()
-> 2125                         pdf = _check_dataframe_convert_date(pdf, self.schema)
   2126                         return _check_dataframe_localize_timestamps(pdf, timezone)
   2127                     else:

/opt/anaconda/lib/python3.7/site-packages/pyspark/sql/types.py in _check_dataframe_convert_date(pdf, schema)
   1705     """
   1706     for field in schema:
-> 1707         pdf[field.name] = _check_series_convert_date(pdf[field.name], field.dataType)
   1708     return pdf
   1709 

/opt/anaconda/lib/python3.7/site-packages/pyspark/sql/types.py in _check_series_convert_date(series, data_type)
   1690     """
   1691     if type(data_type) == DateType:
-> 1692         return series.dt.date
   1693     else:
   1694         return series

/opt/anaconda/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5061         if (name in self._internal_names_set or name in self._metadata or
   5062                 name in self._accessors):
-> 5063             return object.__getattribute__(self, name)
   5064         else:
   5065             if self._info_axis._can_hold_identifiers_and_holds_name(name):

/opt/anaconda/lib/python3.7/site-packages/pandas/core/accessor.py in __get__(self, obj, cls)
    169             # we're accessing the attribute of the class, i.e., Dataset.geo
    170             return self._accessor
--> 171         accessor_obj = self._accessor(obj)
    172         # Replace the property with the accessor object. Inspired by:
    173         # http://www.pydanny.com/cached-property.html

/opt/anaconda/lib/python3.7/site-packages/pandas/core/indexes/accessors.py in __new__(cls, data)
    322             pass  # we raise an attribute error anyway
    323 
--> 324         raise AttributeError("Can only use .dt accessor with datetimelike "
    325                              "values")

AttributeError: Can only use .dt accessor with datetimelike values

Any way to fix it? maybe convert something in the original datafrme?

Hmm, can't see exactly what the issue could be. Perhaps you could try converting your date column to timestamp, then trying again: from pyspark.sql.functions import to_timestamp; res2 = res.withColumn('DATE', to_timestamp(res.DATE, 'yyyy-MM-dd')).toPandas() — cs95
– cs95, Commented Jul 21, 2019 at 7:41
@cs95 it worked! thanks! (I could accept that as an answer). — Assaf
– Assaf, Commented Jul 21, 2019 at 7:45

cs95 · Accepted Answer · 2019-07-21 08:28:30Z

16

As a workaround, you may consider converting your date column to timestamp (this is more aligned with pandas' datetime type).

from pyspark.sql.functions import to_timestamp
res2 = res.withColumn('DATE', to_timestamp(res.DATE, 'yyyy-MM-dd')).toPandas()

answered Jul 21, 2019 at 8:28

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

pauljohn32 · Accepted Answer · 2022-03-11 16:40:20Z

I just tested the recommendation by @cs95, which does work. However, in the output of toPandas, I still see the hour:minute:seconds output, like

|2017-02-15 00:00:00|

Because Pandas takes the date value and fills in times, even though we don't want them. Hence, the format in to_timestamp is not helpful.

I get same result with somewhat simpler code:

col_name = "DATE"
res.withColumn(col_name, col(col_name).cast("timestamp"))

Here's why this is more workable. In the use case I confront, there are many (many!) columns in the Spark DataFrame and I need to find all of one type and convert to another. For example, toPandas complains about Spark Decimal variables and recommends conversion. I don't want to run that one variable at a time.

Here's a function. I have defaults set for the decimal case, but this approach works for any types to convert.

from pyspark.sql.functions import col

def spark_type_converter(sdf, x="decimal", y="float"):
    """This uses Spark cast to convert variables of type `x` to `y`.
    :param sdf: Spark data frame
    :param x: character string for input type to cast FROM. 
        This is matched against `dtype` with `startswith`, 
        so only first few characters matter.
    :param y: character string for name of type to cast INTO.
    """
    target_cols = [i for i,j in sdf.dtypes if j.startswith(x)]
    for col_name in target_cols:
        sdf = sdf.withColumn(col_name, col(col_name).cast(y)) 
    return sdf

Use in this case, to find all date variables and change to timestamp:

res = spark_type_convert(res, x="date", y="timestamp")

When toPandas runs, this "one size fits all" approach gives same result.

Collectives™ on Stack Overflow

Converting pyspark DataFrame with date column to Pandas results in AttributeError

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related