The most important thing is to use when. This saves a significant amount of computation time. Beyond that, it becomes harder to see substantive improvements in performance.
The to_date function is relatively slow, especially when format string is provided. It may be convenient for relative simplicity, but when performance is important, I would try looking for alternatives such as simple cast or make_date.
The following are the results of some performance tests. The "best" results are not necessarily best. That's because the difference becomes small and results are different every time this kind of test is run. It is not easy to test performance of a distributed computation. The results will also depend on your data. What can clearly be seen, is the conclusions from above: performance increases when using when and avoiding to_date.
Dataframe:
import random
# Generating rows of date strings in random format: "yyyyMMdd", "yyyy.MM.dd" or "yyyy-MM-dd".
data = [
(random.choice([
str(random.randint(2020, 2023)) + str(random.randint(1, 12)).zfill(2) + str(random.randint(1, 28)).zfill(2),
str(random.randint(2020, 2023)) + '-' + str(random.randint(1, 12)).zfill(2) + '-' + str(random.randint(1, 28)).zfill(2),
str(random.randint(2020, 2023)) + '.' + str(random.randint(1, 12)).zfill(2) + '.' + str(random.randint(1, 28)).zfill(2),
]),)
for _ in range(100000)
]
df = spark.createDataFrame(data, ["date_str"])
df = df.persist()
Functions:
from pyspark.sql import functions as F
def parse_date_coalesce(col):
formats = ("yyyyMMdd", "yyyy.MM.dd", "yyyy-MM-dd")
return F.coalesce(*(F.to_date(col, f) for f in formats))
def parse_date_when(col):
return F.when(F.col(col).contains("-"), F.to_date(col, "yyyy-MM-dd")) \
.when(F.col(col).contains("."), F.to_date(col, "yyyy.MM.dd")) \
.otherwise(F.to_date(col, "yyyyMMdd"))
def parse_date_translate_eq(col):
return F.when(F.col(col)[5:1] == '-', F.col(col).cast('date')) \
.when(F.col(col)[5:1] == '.', F.translate(col, ".", "-").cast('date')) \
.otherwise(F.make_date(F.col(col)[1:4], F.col(col)[5:2], F.col(col)[7:2]))
def parse_date_translate_contains(col):
return F.when(F.col(col).contains("-"), F.col(col).cast('date')) \
.when(F.col(col).contains("."), F.translate(col, ".", "-").cast('date')) \
.otherwise(F.make_date(F.col(col)[1:4], F.col(col)[5:2], F.col(col)[7:2]))
def parse_date_make_date_eq(col):
return F.when(F.col(col)[5:1] == '-', F.col(col).cast('date')) \
.when(F.col(col)[5:1] == '.', F.make_date(F.col(col)[1:4], F.col(col)[6:2], F.col(col)[9:2])) \
.otherwise(F.make_date(F.col(col)[1:4], F.col(col)[5:2], F.col(col)[7:2]))
def parse_date_make_date_contains(col):
return F.when(F.col(col).contains("-"), F.col(col).cast('date')) \
.when(F.col(col).contains("."), F.make_date(F.col(col)[1:4], F.col(col)[6:2], F.col(col)[9:2])) \
.otherwise(F.make_date(F.col(col)[1:4], F.col(col)[5:2], F.col(col)[7:2]))
Results:
%timeit -n 5 -r 5 df.withColumn("date_parsed", parse_date_coalesce('date_str')).write.format("noop").mode("overwrite").save()
# 1.15 s ± 141 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
%timeit -n 20 -r 20 df.withColumn("date_parsed", parse_date_when('date_str')).write.format("noop").mode("overwrite").save()
# 145 ms ± 25.7 ms per loop (mean ± std. dev. of 20 runs, 20 loops each)
%timeit -n 20 -r 20 df.withColumn("date_parsed", parse_date_translate_eq('date_str')).write.format("noop").mode("overwrite").save()
# 112 ms ± 19.6 ms per loop (mean ± std. dev. of 20 runs, 20 loops each)
%timeit -n 20 -r 20 df.withColumn("date_parsed", parse_date_translate_contains('date_str')).write.format("noop").mode("overwrite").save()
# 110 ms ± 18 ms per loop (mean ± std. dev. of 20 runs, 20 loops each)
%timeit -n 20 -r 20 df.withColumn("date_parsed", parse_date_make_date_eq('date_str')).write.format("noop").mode("overwrite").save()
# 106 ms ± 15.3 ms per loop (mean ± std. dev. of 20 runs, 20 loops each)
%timeit -n 20 -r 20 df.withColumn("date_parsed", parse_date_make_date_contains('date_str')).write.format("noop").mode("overwrite").save()
# 101 ms ± 15.2 ms per loop (mean ± std. dev. of 20 runs, 20 loops each)
IMHO, you can shave off some time, but no more than 20-30% from what you currently have. Talking only about date-parsing part. Your production script probably has more parts, not mentioning cluster/ Spark application initiation, reading and writing the data, which all contribute to the processing time. So the small boost from date-processing may well become invisible.