-2

I have to insert values available in DataFrame1 into one of the column with empty values with DataFrame2. Basically updating column in DataFrame2.

Both DataFrames have 2 common columns.

Is there a way to do same using Java? Or there can be different approach?

Sample Input :

1) File1.csv

BILL_ID,BILL_NBR_TYPE_CD,BILL_NBR,VERSION,PRIM_SW
0501841898,BIN     ,404154,1000,Y
0681220958,BIN     ,735332,1000,Y
5992410180,BIN     ,454680,1000,Y
6995270884,SREBIN  ,1000252750295575,1000,Y

Here BILL_ID is system id and BILL_NBR is external id.

2) File2.csv

TXN_ID,TXN_TYPE,BILL_ID,BILL_NBR_TYPE_CD,BILL_NBR
01234, ABC     ,"     ",BIN     ,404154
22365, XYZ     ,"     ",BIN     ,735332
45890, LKJ     ,"     ",BIN     ,454680
23456, MPK     ,"     ",SREBIN  ,1000252750295575

Sample Output

As shown below BILL_ID value should be populated in File2.csv

01234, ABC     ,501841898,BIN     ,404154
22365, XYZ     ,681220958,BIN     ,735332
45890, LKJ     ,5992410180,BIN     ,454680
23456, MPK     ,6995270884,SREBIN  ,1000252750295575

I have created two DataFrames and loaded both file's data into it, now I am not sure how to proceed.

EDIT

Basically I want clarity on below three steps:

  1. how to get BILL_NBR and BILL_NBR_TYPE_CD values from File2.csv?

For this step I have written : file2Df.select("BILL_NBR_TYPE_CD","BILL_NBR");

  1. How to get BILL_ID values from File1.csv based on the values retrieved in step1 ?

  2. How to update BILL_ID values accordingly in File2.csv ?

I am new to spark and I would appreciate if someone can give pointers.

4
  • This is a simple SQL join problem. Do an inner join between df1 and df2 and then select column appropriately from either df1 or df2 Commented Apr 23, 2018 at 13:42
  • Duplicate: stackoverflow.com/questions/43033835/… Commented Apr 23, 2018 at 13:45
  • @philantrovert Thank you for pointing out ......but can inner join be performed based on two columns? I was checking the API for same. also BILL_ID column which is empty in File2, where will that go? Commented Apr 24, 2018 at 6:57
  • @philantrovert I have tried Dataset <Row> joined = txnDf.join(accountDf,txnDf.col("BILL_NBR").equalTo(accountDf.col("BILL_NBR")).and(txnDf.col("BILL_NBR_TYPE_CD").equalTo(accountDf.col("BILL_NBR_TYPE_CD"))),"inner"); as per your suggestion but got this error : Exception in thread "main" org.apache.spark.sql.AnalysisException: Found duplicate column(s) when inserting allation and books/Output Files/Transformed23Apr.csv: bill_nbr, bill_id, bill_nbr_type_cd; Commented Apr 24, 2018 at 7:23

1 Answer 1

0

You need to join two tables based on BILL_NBR column.

Assumption: There is one to one relation between BILL_NBR and BILL_ID columns.

Assuming that your dataframe names for File1.csv and File2.csv are file1DF and file2DF respectively, following should work for you:

Dataset<Row> file1DF = file1DF.select("BILL_ID","BILL_NBR","BILL_NBR_TYPE_CD");
Dataset<Row> file2DF = file2DF.select("TXN_ID","TXN_TYPE","BILL_NBR_TYPE_CD","BILL_NBR");
Dataset<Row> file2DF = file2DF.join(file1DF, file1DF("BILL_NBR","BILL_NBR_TYPE_CD"));

Note: I haven't got resources to test above code by running it. Please let me know if you face any compile time or run time error.

Sign up to request clarification or add additional context in comments.

6 Comments

there is one to one relation between BILL_NBR,BILL_NBR_TYPE_CD to BILL_ID, so join should be done based on those two columns right? can you update the code?
yeah add the other column also and it should work. Have you tried that?
Updated the code. Not sure about correctness of the syntax.
do a crossJoin() and store it back on some condition
@vatsalmevada in the last line you have written file1DF("BILL_NBR","BILL_NBR_TYPE_CD") which gives compiler error as there is no such function defined, do you meant to use select there?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.