The functions.expr("[SQL]") can be used as an alternative way to query in so many cases, for instance:
df2=df.withColumn("gender", expr("CASE WHEN gender = 'M' THEN 'Male' " +
"WHEN gender = 'F' THEN 'Female' ELSE 'unknown' END"))
which is equal to
df2=df.withColumn("gender", when(col("gender") == "M", "Male")
.when(col("gender") == "F", "Female")
.otherwise("Unknown")
I am wondering, does it have a performance difference?
And what about the following example (which functions API doesn't have an out-of-box solution to add hours)?
df = df.withColumn('testing_time', df.testing_time + expr('INTERVAL 2 HOURS'))
VS
df = df.withColumn("testing_time", (unix_timestamp("testing_time") + 7200).cast('timestamp'))
Finally, do you suggest to use functions.expr where ever it could be?