5

I have below two datasets

controlSetDF : has columns loan_id, merchant_id, loan_type, created_date, as_of_date
accountDF : has columns merchant_id, id, name, status, merchant_risk_status

I am using Java spark api to join them, I need only specific columns in the final dataset

private String[] control_set_columns = {"loan_id", "merchant_id", "loan_type"};
private String[] sf_account_columns = {"id as account_id", "name as account_name", "merchant_risk_status"};

controlSetDF.selectExpr(control_set_columns)                                               
.join(accountDF.selectExpr(sf_account_columns),controlSetDF.col("merchant_id").equalTo(accountDF.col("merchant_id")), 
"left_outer"); 

But I get below error

org.apache.spark.sql.AnalysisException: resolved attribute(s) merchant_id#3L missing from account_name#131,loan_type#105,account_id#130,merchant_id#104L,loan_id#103,merchant_risk_status#2 in operator !Join LeftOuter, (merchant_id#104L = merchant_id#3L);;!Join LeftOuter, (merchant_id#104L = merchant_id#3L)

There seems to be an issue because both dataframes have merchant_id column.

NOTE: If I don't use the .selectExpr() it works fine. But It will show all columns from first and second datasets.

2 Answers 2

2

If the join columns are named the same in both DataFrames, you can simply define it as the join condition. In Scala it's a bit cleaner, with Java you need to convert a Java List to a Scala Seq:

Seq<String> joinColumns = scala.collection.JavaConversions
  .asScalaBuffer(Lists.newArrayList("merchant_id"));

controlSetDF.selectExpr(control_set_columns)
  .join(accountDF.selectExpr(sf_account_columns), joinColumns), "left_outer");

This will result in a DataFrame with only one of the join columns.

Sign up to request clarification or add additional context in comments.

Comments

1

You are joining with DataFrame with columns listed in sf_account_columns. This array doesn't contain column on which you want to join, so DataFrame also doesn't have it. Add this column to the mentioned array

3 Comments

This works but the final dataset will have duplicate entry of the merchant_id columns. How do I avoid that? I want the final dataset to show only merchant_id from controlSetDF.
@NewQueries Give this column an alias and do select after join :)
Thanks @T. Gaweda. Yes I would have done this eventually. Thanks for you responses. Using Seq<String> worked for me

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.