1

I have to find the exact hour that most checkins occur in the Yelp dataset but I'm running up against this error for some reason. Here is my code so far:

from pyspark.sql.functions import udf
from pyspark.sql.functions import explode
from pyspark.sql.types import IntegerType
from pyspark.sql.types import ArrayType,StringType
from pyspark.sql import functions as F

square_udf_int = udf(lambda z: square(z), IntegerType())

checkin = spark.read.json('yelp_academic_dataset_checkin.json.gz')
datesplit = udf(lambda x: x.split(','),ArrayType(StringType()))
checkin.select('business_id',datesplit('date').alias('dates')).withColumn('checkin_date',explode('dates'))
datesplit = udf(lambda x: x.split(','),ArrayType(StringType()))
dates = checkin.select('business_id',datesplit('date').alias('dates')).withColumn('checkin_date',explode('dates'))
dates = dates.select("checkin_date")
dates.withColumn("checkin_date", F.date_trunc('checkin_date',
                   F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss 'UTC'"))).show(truncate=0)

And the error:

Py4JJavaError: An error occurred while calling o1112.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '`timestamp`' given input columns: [checkin_date];;
'Project [date_trunc(checkin_date, to_timestamp('timestamp, Some(yyyy-MM-dd HH:mm:ss 'UTC')), Some(Etc/UTC)) AS checkin_date#190]
+- Project [checkin_date#176]
   +- Project [business_id#6, dates#172, checkin_date#176]
      +- Generate explode(dates#172), false, [checkin_date#176]
         +- Project [business_id#6, <lambda>(date#7) AS dates#172]
            +- Relation[business_id#6,date#7] json

dates is just a Spark dataframe with one column named: "checkin_date" with only datetimes, so I'm not sure why this isn't working.

1 Answer 1

2

The error you obtain simply means that in the following line of code, you are trying to access a column named timestamp and that it does not exist.

dates.withColumn("checkin_date", F.date_trunc('checkin_date',
                   F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss 'UTC'")))

Indeed, here is the signature of the to_timestamp function:

pyspark.sql.functions.to_timestamp(col, format=None)

The first argument is the column, the second is the format. I am assuming you are trying to parse a date and then truncate it. Let's say you want to truncate the date to the month level. The correct way to do it would be:

dates.withColumn("checkin_date", F.date_trunc('month',
                   F.to_timestamp('checkin_date', "yyyy-MM-dd HH:mm:ss 'UTC'")))
Sign up to request clarification or add additional context in comments.

5 Comments

Hi, Oli. Thank you! When I try to run this to truncate the date to just the hour level, and look at a couple of the rows, I'm getting all nulls for each row.
That means that the date format is incorrect. Check that it matches the way your dates are formatted. You may add an example in your question if you need help. Your format should work with dates like this: "2021-11-04 17:00:34 UTC". If your strings do not contain UTC, simply remove it from the format.
Oh! That worked. I just had to remove UTC. Do you know how I might then select the top most frequent hour from the new dataframe?
I figured it out. Using: row1 = df1.agg({"x": "max"}).collect()[0]
Nice ;) Glad I could help!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.