PySpark Error: cannot resolve '`timestamp`'

Question

I have to find the exact hour that most checkins occur in the Yelp dataset but I'm running up against this error for some reason. Here is my code so far:

from pyspark.sql.functions import udf
from pyspark.sql.functions import explode
from pyspark.sql.types import IntegerType
from pyspark.sql.types import ArrayType,StringType
from pyspark.sql import functions as F

square_udf_int = udf(lambda z: square(z), IntegerType())

checkin = spark.read.json('yelp_academic_dataset_checkin.json.gz')
datesplit = udf(lambda x: x.split(','),ArrayType(StringType()))
checkin.select('business_id',datesplit('date').alias('dates')).withColumn('checkin_date',explode('dates'))
datesplit = udf(lambda x: x.split(','),ArrayType(StringType()))
dates = checkin.select('business_id',datesplit('date').alias('dates')).withColumn('checkin_date',explode('dates'))
dates = dates.select("checkin_date")
dates.withColumn("checkin_date", F.date_trunc('checkin_date',
                   F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss 'UTC'"))).show(truncate=0)

And the error:

Py4JJavaError: An error occurred while calling o1112.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '`timestamp`' given input columns: [checkin_date];;
'Project [date_trunc(checkin_date, to_timestamp('timestamp, Some(yyyy-MM-dd HH:mm:ss 'UTC')), Some(Etc/UTC)) AS checkin_date#190]
+- Project [checkin_date#176]
   +- Project [business_id#6, dates#172, checkin_date#176]
      +- Generate explode(dates#172), false, [checkin_date#176]
         +- Project [business_id#6, <lambda>(date#7) AS dates#172]
            +- Relation[business_id#6,date#7] json

dates is just a Spark dataframe with one column named: "checkin_date" with only datetimes, so I'm not sure why this isn't working.

Oli · Accepted Answer · 2021-11-04 16:04:02Z

2

The error you obtain simply means that in the following line of code, you are trying to access a column named timestamp and that it does not exist.

dates.withColumn("checkin_date", F.date_trunc('checkin_date',
                   F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss 'UTC'")))

Indeed, here is the signature of the to_timestamp function:

pyspark.sql.functions.to_timestamp(col, format=None)

The first argument is the column, the second is the format. I am assuming you are trying to parse a date and then truncate it. Let's say you want to truncate the date to the month level. The correct way to do it would be:

dates.withColumn("checkin_date", F.date_trunc('month',
                   F.to_timestamp('checkin_date', "yyyy-MM-dd HH:mm:ss 'UTC'")))

edited Nov 4, 2021 at 16:04

answered Nov 4, 2021 at 15:05

Oli

10.5k5 gold badges31 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Hefe Over a year ago

Hi, Oli. Thank you! When I try to run this to truncate the date to just the hour level, and look at a couple of the rows, I'm getting all nulls for each row.

Oli Over a year ago

That means that the date format is incorrect. Check that it matches the way your dates are formatted. You may add an example in your question if you need help. Your format should work with dates like this: "2021-11-04 17:00:34 UTC". If your strings do not contain UTC, simply remove it from the format.

Hefe Over a year ago

Oh! That worked. I just had to remove UTC. Do you know how I might then select the top most frequent hour from the new dataframe?

Hefe Over a year ago

I figured it out. Using: row1 = df1.agg({"x": "max"}).collect()[0]

Oli Over a year ago

Nice ;) Glad I could help!

Collectives™ on Stack Overflow

PySpark Error: cannot resolve '`timestamp`'

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related