2

The functions.expr("[SQL]") can be used as an alternative way to query in so many cases, for instance:

df2=df.withColumn("gender", expr("CASE WHEN gender = 'M' THEN 'Male' " +
           "WHEN gender = 'F' THEN 'Female' ELSE 'unknown' END"))

which is equal to

df2=df.withColumn("gender", when(col("gender") == "M", "Male")
                           .when(col("gender") == "F", "Female")
                           .otherwise("Unknown")

I am wondering, does it have a performance difference?

And what about the following example (which functions API doesn't have an out-of-box solution to add hours)?

df = df.withColumn('testing_time', df.testing_time + expr('INTERVAL 2 HOURS'))

VS

df = df.withColumn("testing_time", (unix_timestamp("testing_time") + 7200).cast('timestamp'))

Finally, do you suggest to use functions.expr where ever it could be?

1

1 Answer 1

2
  • does it have a performance difference?

    No, both versions are identic in every aspect, including performance.

    from pyspark.sql import functions as F
    df = spark.createDataFrame([("M",), ("F",)], ["gender"])
    
    df2 = df.withColumn("gender", F.when(F.col("gender") == "M", "Male")
                                   .when(F.col("gender") == "F", "Female")
                                   .otherwise("Unknown"))
    
    df3 = df.withColumn("gender", F.expr("CASE WHEN gender = 'M' THEN 'Male' " +
                                              "WHEN gender = 'F' THEN 'Female' ELSE 'Unknown' END"))
    

    PySpark code doesn't directly make Spark run the algorithm. It creates logical and physical plans which actually run the algorithm. You can inspect them and compare - they are identic.

    df2.explain()
    # == Physical Plan ==
    # *(1) Project [CASE WHEN (gender#49 = M) THEN Male WHEN (gender#49 = F) THEN Female ELSE Unknown END AS gender#51]
    # +- *(1) Scan ExistingRDD[gender#49]
    
    df3.explain()
    # == Physical Plan ==
    # *(1) Project [CASE WHEN (gender#49 = M) THEN Male WHEN (gender#49 = F) THEN Female ELSE Unknown END AS gender#53]
    # +- *(1) Scan ExistingRDD[gender#49]
    
    df2.sameSemantics(df3)  # Available in Spark 3.1+
    # True
    
  • Regarding the use of expr, use it

    • when you don't have an equivalent in PySpark
    • when your Spark version doesn't yet support PySpark equivalent
    • when PySpark function expects a value, but you want to provide a column (e.g. this case)

    Otherwise, it often looks cleaner when written in PySpark.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.