Spark-SQL Joining two dataframes/ datasets with same column name

Question

I have below two datasets

controlSetDF : has columns loan_id, merchant_id, loan_type, created_date, as_of_date
accountDF : has columns merchant_id, id, name, status, merchant_risk_status

I am using Java spark api to join them, I need only specific columns in the final dataset

private String[] control_set_columns = {"loan_id", "merchant_id", "loan_type"};
private String[] sf_account_columns = {"id as account_id", "name as account_name", "merchant_risk_status"};

controlSetDF.selectExpr(control_set_columns)                                               
.join(accountDF.selectExpr(sf_account_columns),controlSetDF.col("merchant_id").equalTo(accountDF.col("merchant_id")), 
"left_outer");

But I get below error

org.apache.spark.sql.AnalysisException: resolved attribute(s) merchant_id#3L missing from account_name#131,loan_type#105,account_id#130,merchant_id#104L,loan_id#103,merchant_risk_status#2 in operator !Join LeftOuter, (merchant_id#104L = merchant_id#3L);;!Join LeftOuter, (merchant_id#104L = merchant_id#3L)

There seems to be an issue because both dataframes have merchant_id column.

NOTE: If I don't use the .selectExpr() it works fine. But It will show all columns from first and second datasets.

Silvio · Accepted Answer · 2017-04-20 04:33:41Z

2

If the join columns are named the same in both DataFrames, you can simply define it as the join condition. In Scala it's a bit cleaner, with Java you need to convert a Java List to a Scala Seq:

Seq<String> joinColumns = scala.collection.JavaConversions
  .asScalaBuffer(Lists.newArrayList("merchant_id"));

controlSetDF.selectExpr(control_set_columns)
  .join(accountDF.selectExpr(sf_account_columns), joinColumns), "left_outer");

This will result in a DataFrame with only one of the join columns.

answered Apr 20, 2017 at 4:33

Silvio

4,22726 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

T. Gawęda · Accepted Answer · 2017-04-19 22:05:48Z

1

You are joining with DataFrame with columns listed in sf_account_columns. This array doesn't contain column on which you want to join, so DataFrame also doesn't have it. Add this column to the mentioned array

answered Apr 19, 2017 at 22:05

T. Gawęda

16.1k5 gold badges51 silver badges62 bronze badges

3 Comments

NewQueries Over a year ago

This works but the final dataset will have duplicate entry of the merchant_id columns. How do I avoid that? I want the final dataset to show only merchant_id from controlSetDF.

T. Gawęda Over a year ago

@NewQueries Give this column an alias and do select after join :)

NewQueries Over a year ago

Thanks @T. Gaweda. Yes I would have done this eventually. Thanks for you responses. Using Seq<String> worked for me

Collectives™ on Stack Overflow

Spark-SQL Joining two dataframes/ datasets with same column name

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related