2

Should adding a hint to a spark query ever return different results? I'm able to reproduce a production issue with the code below. It's worth noting that the last statement is just a copy/pasted version of the prior statement with a hint added. However, the last statement returns 0 records whereas the prior statement returns 1 record.

import spark.implicits._

val geohashes = Seq(
  ("9ykchgz95z"),
  ("abckd3kdf1"),
).toDF("geohash")
.repartition(2)
geohashes.createOrReplaceTempView("geohashes")

val geohashesToInclude = Seq(
  ("9ykchgz91"),
  ("9ykchgz92"),
  ("9ykchgz93"),
  ("9ykchgz94"),
  ("9ykchgz95"),
  ("9ykchgz96"),
  ("9ykchgz97"),
  ("9ykchgz98"),
  ("9ykchgz99"),
  ("9ykchgz90"),
).toDF("geohash_prefix")
.repartition(10)
geohashesToInclude.createOrReplaceTempView("geohashes_to_include")

spark.sql("SELECT * FROM geohashes g LEFT SEMI JOIN geohashes_to_include i ON SUBSTRING(g.geohash, 1, 7) = SUBSTRING(i.geohash_prefix, 1, 7) AND STARTSWITH(g.geohash, i.geohash_prefix)").show()
    
+----------+
|   geohash|
+----------+
|9ykchgz95z|
+----------+

spark.sql("SELECT /*+ SHUFFLE_HASH(i) */ * FROM geohashes g LEFT SEMI JOIN geohashes_to_include i ON SUBSTRING(g.geohash, 1, 7) = SUBSTRING(i.geohash_prefix, 1, 7) AND STARTSWITH(g.geohash, i.geohash_prefix)").show()

+-------+
|geohash|
+-------+
+-------+

I ran this sample code in spark 3.5.3, but the issue was discovered in the latest version of emr's spark 3.5.5. I was also able to reproduce the issue in spark 3.4.1

1 Answer 1

1

I was able to reproduce this on spark 3.5.6. This seems like it could be a bug, the join strategy should not affect the query results, only the physical execution of the query. It is worth noting that shuffle_hash does not support non-equi joins (like STARTSWITH), but according to the code, it should be allowed if there is at least one equi-join key (which you have included in the join statement comparing the substrings for equality)

It could be worth raising a bug ticket in the Apache Spark project (contributing guidelines, Apache Spark Jira).

However, if you just want a quick workaround that will give you consistent query results, you can instead use a left outer join then filter out the null values.

No join hint:

spark.sql("""
SELECT g.* 
FROM geohashes g 
LEFT JOIN geohashes_to_include i 
ON SUBSTRING(g.geohash, 1, 7) = SUBSTRING(i.geohash_prefix, 1, 7) AND STARTSWITH(g.geohash, i.geohash_prefix)
WHERE i.geohash_prefix IS NOT NULL
""")
.show()

+----------+
|   geohash|
+----------+
|9ykchgz95z|
+----------+

With join hint:

spark.sql("""
SELECT /*+ SHUFFLE_HASH(i) */ g.* 
FROM geohashes g 
LEFT JOIN geohashes_to_include i 
ON SUBSTRING(g.geohash, 1, 7) = SUBSTRING(i.geohash_prefix, 1, 7) AND STARTSWITH(g.geohash, i.geohash_prefix)
WHERE i.geohash_prefix IS NOT NULL
""")
.show()

+----------+
|   geohash|
+----------+
|9ykchgz95z|
+----------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.