9

I have the following data frame (pyspark) -

 |-- DATE: date (nullable = true)
 |-- ID: string (nullable = true)
 |-- A: double (nullable = true)
 |-- B: double (nullable = true)

While trying to convert the data frame to pandas -

res2 = res.toPandas()

I got the following error - AttributeError: Can only use .dt accessor with datetimelike values

Detailed error -

    AttributeError                            Traceback (most recent call last)
<ipython-input-29-471067d510fa> in <module>
----> 1 res2 = res.toPandas()

/opt/anaconda/lib/python3.7/site-packages/pyspark/sql/dataframe.py in toPandas(self)
   2123                         table = pyarrow.Table.from_batches(batches)
   2124                         pdf = table.to_pandas()
-> 2125                         pdf = _check_dataframe_convert_date(pdf, self.schema)
   2126                         return _check_dataframe_localize_timestamps(pdf, timezone)
   2127                     else:

/opt/anaconda/lib/python3.7/site-packages/pyspark/sql/types.py in _check_dataframe_convert_date(pdf, schema)
   1705     """
   1706     for field in schema:
-> 1707         pdf[field.name] = _check_series_convert_date(pdf[field.name], field.dataType)
   1708     return pdf
   1709 

/opt/anaconda/lib/python3.7/site-packages/pyspark/sql/types.py in _check_series_convert_date(series, data_type)
   1690     """
   1691     if type(data_type) == DateType:
-> 1692         return series.dt.date
   1693     else:
   1694         return series

/opt/anaconda/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5061         if (name in self._internal_names_set or name in self._metadata or
   5062                 name in self._accessors):
-> 5063             return object.__getattribute__(self, name)
   5064         else:
   5065             if self._info_axis._can_hold_identifiers_and_holds_name(name):

/opt/anaconda/lib/python3.7/site-packages/pandas/core/accessor.py in __get__(self, obj, cls)
    169             # we're accessing the attribute of the class, i.e., Dataset.geo
    170             return self._accessor
--> 171         accessor_obj = self._accessor(obj)
    172         # Replace the property with the accessor object. Inspired by:
    173         # http://www.pydanny.com/cached-property.html

/opt/anaconda/lib/python3.7/site-packages/pandas/core/indexes/accessors.py in __new__(cls, data)
    322             pass  # we raise an attribute error anyway
    323 
--> 324         raise AttributeError("Can only use .dt accessor with datetimelike "
    325                              "values")

AttributeError: Can only use .dt accessor with datetimelike values

Any way to fix it? maybe convert something in the original datafrme?

4
  • Can you print the output of res.show()? Commented Jul 21, 2019 at 7:34
  • @cs95 added the output to the original post. Thanks! Commented Jul 21, 2019 at 7:37
  • 1
    Hmm, can't see exactly what the issue could be. Perhaps you could try converting your date column to timestamp, then trying again: from pyspark.sql.functions import to_timestamp; res2 = res.withColumn('DATE', to_timestamp(res.DATE, 'yyyy-MM-dd')).toPandas() Commented Jul 21, 2019 at 7:41
  • @cs95 it worked! thanks! (I could accept that as an answer). Commented Jul 21, 2019 at 7:45

2 Answers 2

16

As a workaround, you may consider converting your date column to timestamp (this is more aligned with pandas' datetime type).

from pyspark.sql.functions import to_timestamp
res2 = res.withColumn('DATE', to_timestamp(res.DATE, 'yyyy-MM-dd')).toPandas()
Sign up to request clarification or add additional context in comments.

Comments

3

I just tested the recommendation by @cs95, which does work. However, in the output of toPandas, I still see the hour:minute:seconds output, like

|2017-02-15 00:00:00|

Because Pandas takes the date value and fills in times, even though we don't want them. Hence, the format in to_timestamp is not helpful.

I get same result with somewhat simpler code:

col_name = "DATE"
res.withColumn(col_name, col(col_name).cast("timestamp"))

Here's why this is more workable. In the use case I confront, there are many (many!) columns in the Spark DataFrame and I need to find all of one type and convert to another. For example, toPandas complains about Spark Decimal variables and recommends conversion. I don't want to run that one variable at a time.

Here's a function. I have defaults set for the decimal case, but this approach works for any types to convert.

from pyspark.sql.functions import col

def spark_type_converter(sdf, x="decimal", y="float"):
    """This uses Spark cast to convert variables of type `x` to `y`.
    :param sdf: Spark data frame
    :param x: character string for input type to cast FROM. 
        This is matched against `dtype` with `startswith`, 
        so only first few characters matter.
    :param y: character string for name of type to cast INTO.
    """
    target_cols = [i for i,j in sdf.dtypes if j.startswith(x)]
    for col_name in target_cols:
        sdf = sdf.withColumn(col_name, col(col_name).cast(y)) 
    return sdf

Use in this case, to find all date variables and change to timestamp:

res = spark_type_convert(res, x="date", y="timestamp")

When toPandas runs, this "one size fits all" approach gives same result.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.