Error with join in scala spark. org.apache.spark.sql.AnalysisException: Resolved attribute(s) missing

Question

Joining two dataframes is throwing a rather weird error in Scala Spark. I am trying to join two data sets namely input and metric

Snippet I am executing->

input.as("input").join(broadcast(metric.as("metric"), Seq("seller","seller_seller_tag"), "left_outer"))

The join breaks with ->

org.apache.spark.sql.AnalysisException: Resolved attribute(s) seller_seller_tag#2763 missing from seller#2483, seller_tag#2484, stats_date#2507, learner_id#2508, keyword_click#2493L, visible_impression#2494L, seller_tag_misc#439 in operator +- !Aggregate [seller#2483, seller_seller_tag#2763], [seller#2483, seller_seller_tag#2763, sum(visible_impression#2494L) AS visible_impression#527L, sum(keyword_click#2493L) AS keyword_click#529L]. Attribute(s) with the same name appear in the operation: seller_seller_tag. Please check if the right attribute(s) are used.;;

I have asserted both the dataframes have the seller_seller_tag column.

The weirdness->

This is the logical plan for metric https://dpaste.com/7NECV5DWX

This is the logical plan for input https://dpaste.com/DYSGLUWCY

And this is the logical plan for the join https://dpaste.com/AGH2BCN2G

In the plan for join, the column #id for seller_seller_tag changes suddenly, for the metric dataframe ->

+- !Aggregate [seller#2483, seller_seller_tag#2763], [seller#2483, seller_seller_tag#2763, sum(visible_impression#2494L) AS visible_impression#527L, sum(keyword_click#2493L) AS keyword_click#529L]
+- Project [seller#2483, seller_tag#2484, stats_date#2507, learner_id#2508, keyword_click#2493L, visible_impression#2494L, seller_tag_misc#439, concat(seller#2483, __, seller_tag_misc#439) AS seller_seller_tag#482]

while the same step in the the plan for metric is->

+- Aggregate [seller#2483, seller_seller_tag#482], [seller#2483, seller_seller_tag#482, sum(visible_impression#2494L) AS visible_impression#527L, sum(keyword_click#2493L) AS keyword_click#529L]
+- Project [seller#2483, seller_tag#2484, stats_date#2507, learner_id#2508, keyword_click#2493L, visible_impression#2494L, seller_tag_misc#439, concat(seller#2483, __, seller_tag_misc#439) AS seller_seller_tag#482]

The plan changes ids on join! Explanation!?

Ayush Goyal · Accepted Answer · 2022-03-14 09:07:29Z

1

This is a bug with spark 2.4.0 (and probably lesser, not sure). Even though the SQL appears legitimate, AnalysisException will be thrown when your query has multiple joins. This happens because of the ResolveReferences.dedupRight , an internal method spark uses in the process of resolving plans. Thanks to mazaneicha for pointing this out. You can know more about the issue and reproduce it from here https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-32280

The issue has been fixed in the higher versions of spark (2.4.7, 3.0.1, 3.1.0). https://issues.apache.org/jira/browse/SPARK-32280?attachmentSortBy=dateTime

answered Mar 14, 2022 at 9:07

Ayush Goyal

4391 gold badge6 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Error with join in scala spark. org.apache.spark.sql.AnalysisException: Resolved attribute(s) missing

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related