Should adding a hint to a spark query ever return different results? I'm able to reproduce a production issue with the code below. It's worth noting that the last statement is just a copy/pasted version of the prior statement with a hint added. However, the last statement returns 0 records whereas the prior statement returns 1 record.
import spark.implicits._
val geohashes = Seq(
("9ykchgz95z"),
("abckd3kdf1"),
).toDF("geohash")
.repartition(2)
geohashes.createOrReplaceTempView("geohashes")
val geohashesToInclude = Seq(
("9ykchgz91"),
("9ykchgz92"),
("9ykchgz93"),
("9ykchgz94"),
("9ykchgz95"),
("9ykchgz96"),
("9ykchgz97"),
("9ykchgz98"),
("9ykchgz99"),
("9ykchgz90"),
).toDF("geohash_prefix")
.repartition(10)
geohashesToInclude.createOrReplaceTempView("geohashes_to_include")
spark.sql("SELECT * FROM geohashes g LEFT SEMI JOIN geohashes_to_include i ON SUBSTRING(g.geohash, 1, 7) = SUBSTRING(i.geohash_prefix, 1, 7) AND STARTSWITH(g.geohash, i.geohash_prefix)").show()
+----------+
| geohash|
+----------+
|9ykchgz95z|
+----------+
spark.sql("SELECT /*+ SHUFFLE_HASH(i) */ * FROM geohashes g LEFT SEMI JOIN geohashes_to_include i ON SUBSTRING(g.geohash, 1, 7) = SUBSTRING(i.geohash_prefix, 1, 7) AND STARTSWITH(g.geohash, i.geohash_prefix)").show()
+-------+
|geohash|
+-------+
+-------+
I ran this sample code in spark 3.5.3, but the issue was discovered in the latest version of emr's spark 3.5.5. I was also able to reproduce the issue in spark 3.4.1