Joining two dataframes is throwing a rather weird error in Scala Spark. I am trying to join two data sets namely input and metric
Snippet I am executing->
input.as("input").join(broadcast(metric.as("metric"), Seq("seller","seller_seller_tag"), "left_outer"))
The join breaks with ->
org.apache.spark.sql.AnalysisException: Resolved attribute(s) seller_seller_tag#2763 missing from seller#2483, seller_tag#2484, stats_date#2507, learner_id#2508, keyword_click#2493L, visible_impression#2494L, seller_tag_misc#439 in operator +- !Aggregate [seller#2483, seller_seller_tag#2763], [seller#2483, seller_seller_tag#2763, sum(visible_impression#2494L) AS visible_impression#527L, sum(keyword_click#2493L) AS keyword_click#529L]. Attribute(s) with the same name appear in the operation: seller_seller_tag. Please check if the right attribute(s) are used.;;
I have asserted both the dataframes have the seller_seller_tag column.
The weirdness->
This is the logical plan for metric https://dpaste.com/7NECV5DWX
This is the logical plan for input https://dpaste.com/DYSGLUWCY
And this is the logical plan for the join https://dpaste.com/AGH2BCN2G
In the plan for join, the column #id for seller_seller_tag changes suddenly, for the metric dataframe ->
+- !Aggregate [seller#2483, seller_seller_tag#2763], [seller#2483, seller_seller_tag#2763, sum(visible_impression#2494L) AS visible_impression#527L, sum(keyword_click#2493L) AS keyword_click#529L]
+- Project [seller#2483, seller_tag#2484, stats_date#2507, learner_id#2508, keyword_click#2493L, visible_impression#2494L, seller_tag_misc#439, concat(seller#2483, __, seller_tag_misc#439) AS seller_seller_tag#482]
while the same step in the the plan for metric is->
+- Aggregate [seller#2483, seller_seller_tag#482], [seller#2483, seller_seller_tag#482, sum(visible_impression#2494L) AS visible_impression#527L, sum(keyword_click#2493L) AS keyword_click#529L]
+- Project [seller#2483, seller_tag#2484, stats_date#2507, learner_id#2508, keyword_click#2493L, visible_impression#2494L, seller_tag_misc#439, concat(seller#2483, __, seller_tag_misc#439) AS seller_seller_tag#482]
The plan changes ids on join! Explanation!?