1

Joining two dataframes is throwing a rather weird error in Scala Spark. I am trying to join two data sets namely input and metric

Snippet I am executing->

input.as("input").join(broadcast(metric.as("metric"), Seq("seller","seller_seller_tag"), "left_outer"))

The join breaks with ->

org.apache.spark.sql.AnalysisException: Resolved attribute(s) seller_seller_tag#2763 missing from seller#2483, seller_tag#2484, stats_date#2507, learner_id#2508, keyword_click#2493L, visible_impression#2494L, seller_tag_misc#439 in operator +- !Aggregate [seller#2483, seller_seller_tag#2763], [seller#2483, seller_seller_tag#2763, sum(visible_impression#2494L) AS visible_impression#527L, sum(keyword_click#2493L) AS keyword_click#529L]. Attribute(s) with the same name appear in the operation: seller_seller_tag. Please check if the right attribute(s) are used.;;

I have asserted both the dataframes have the seller_seller_tag column.

The weirdness->

This is the logical plan for metric https://dpaste.com/7NECV5DWX

This is the logical plan for input https://dpaste.com/DYSGLUWCY

And this is the logical plan for the join https://dpaste.com/AGH2BCN2G

In the plan for join, the column #id for seller_seller_tag changes suddenly, for the metric dataframe ->

+- !Aggregate [seller#2483, seller_seller_tag#2763], [seller#2483, seller_seller_tag#2763, sum(visible_impression#2494L) AS visible_impression#527L, sum(keyword_click#2493L) AS keyword_click#529L]
+- Project [seller#2483, seller_tag#2484, stats_date#2507, learner_id#2508, keyword_click#2493L, visible_impression#2494L, seller_tag_misc#439, concat(seller#2483, __, seller_tag_misc#439) AS seller_seller_tag#482]

while the same step in the the plan for metric is->

+- Aggregate [seller#2483, seller_seller_tag#482], [seller#2483, seller_seller_tag#482, sum(visible_impression#2494L) AS visible_impression#527L, sum(keyword_click#2493L) AS keyword_click#529L]
+- Project [seller#2483, seller_tag#2484, stats_date#2507, learner_id#2508, keyword_click#2493L, visible_impression#2494L, seller_tag_misc#439, concat(seller#2483, __, seller_tag_misc#439) AS seller_seller_tag#482]

The plan changes ids on join! Explanation!?

0

1 Answer 1

1

This is a bug with spark 2.4.0 (and probably lesser, not sure). Even though the SQL appears legitimate, AnalysisException will be thrown when your query has multiple joins. This happens because of the ResolveReferences.dedupRight , an internal method spark uses in the process of resolving plans. Thanks to mazaneicha for pointing this out. You can know more about the issue and reproduce it from here https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-32280

The issue has been fixed in the higher versions of spark (2.4.7, 3.0.1, 3.1.0). https://issues.apache.org/jira/browse/SPARK-32280?attachmentSortBy=dateTime

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.