Spark hint changes query results

Question

Should adding a hint to a spark query ever return different results? I'm able to reproduce a production issue with the code below. It's worth noting that the last statement is just a copy/pasted version of the prior statement with a hint added. However, the last statement returns 0 records whereas the prior statement returns 1 record.

import spark.implicits._

val geohashes = Seq(
  ("9ykchgz95z"),
  ("abckd3kdf1"),
).toDF("geohash")
.repartition(2)
geohashes.createOrReplaceTempView("geohashes")

val geohashesToInclude = Seq(
  ("9ykchgz91"),
  ("9ykchgz92"),
  ("9ykchgz93"),
  ("9ykchgz94"),
  ("9ykchgz95"),
  ("9ykchgz96"),
  ("9ykchgz97"),
  ("9ykchgz98"),
  ("9ykchgz99"),
  ("9ykchgz90"),
).toDF("geohash_prefix")
.repartition(10)
geohashesToInclude.createOrReplaceTempView("geohashes_to_include")

spark.sql("SELECT * FROM geohashes g LEFT SEMI JOIN geohashes_to_include i ON SUBSTRING(g.geohash, 1, 7) = SUBSTRING(i.geohash_prefix, 1, 7) AND STARTSWITH(g.geohash, i.geohash_prefix)").show()
    
+----------+
|   geohash|
+----------+
|9ykchgz95z|
+----------+

spark.sql("SELECT /*+ SHUFFLE_HASH(i) */ * FROM geohashes g LEFT SEMI JOIN geohashes_to_include i ON SUBSTRING(g.geohash, 1, 7) = SUBSTRING(i.geohash_prefix, 1, 7) AND STARTSWITH(g.geohash, i.geohash_prefix)").show()

+-------+
|geohash|
+-------+
+-------+

I ran this sample code in spark 3.5.3, but the issue was discovered in the latest version of emr's spark 3.5.5. I was also able to reproduce the issue in spark 3.4.1

jwong · Accepted Answer · 2025-07-19 06:21:43Z

I was able to reproduce this on spark 3.5.6. This seems like it could be a bug, the join strategy should not affect the query results, only the physical execution of the query. It is worth noting that shuffle_hash does not support non-equi joins (like STARTSWITH), but according to the code, it should be allowed if there is at least one equi-join key (which you have included in the join statement comparing the substrings for equality)

It could be worth raising a bug ticket in the Apache Spark project (contributing guidelines, Apache Spark Jira).

However, if you just want a quick workaround that will give you consistent query results, you can instead use a left outer join then filter out the null values.

No join hint:

spark.sql("""
SELECT g.* 
FROM geohashes g 
LEFT JOIN geohashes_to_include i 
ON SUBSTRING(g.geohash, 1, 7) = SUBSTRING(i.geohash_prefix, 1, 7) AND STARTSWITH(g.geohash, i.geohash_prefix)
WHERE i.geohash_prefix IS NOT NULL
""")
.show()

+----------+
|   geohash|
+----------+
|9ykchgz95z|
+----------+

With join hint:

spark.sql("""
SELECT /*+ SHUFFLE_HASH(i) */ g.* 
FROM geohashes g 
LEFT JOIN geohashes_to_include i 
ON SUBSTRING(g.geohash, 1, 7) = SUBSTRING(i.geohash_prefix, 1, 7) AND STARTSWITH(g.geohash, i.geohash_prefix)
WHERE i.geohash_prefix IS NOT NULL
""")
.show()

+----------+
|   geohash|
+----------+
|9ykchgz95z|
+----------+

Collectives™ on Stack Overflow

Spark hint changes query results

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related