-1

I am trying to do a POC for current project. where we want to check if spark can be used.

Current system has batch processes which takes data from tables and modifies them based on batch code.

I am new to Apache spark, As a part of POC, I am loading a CSV file into a DataFrame using

Dataset<Row> df = sparkSession.read().format("csv").option("header", true).option("inferSchema", true).load("C:/Users/xyz/Downloads/Apache spark installation and books/data.csv");

Now based on values of Two columns(in csv now) I need to populate third column.

but in earlier system, we need to query a table and based on this two value we used to retrieve third column of that table.

and value of that column we used to populate in main table.

Now I have main table in csv format, but I am not sure how I need to save data of that other table where I need to fetch value based on two columns from main table.

Can you help with the same?

EDIT

More information :

As of my current system I have two tables A and B

Table A

col1 col2 col3

data1 data2 data3

Table B

col1 col2 col3 col4 col5 col6 .......coln

data1 data2 data3 data4 data5 data6 ..........datan

Currently what is happening is :

From table A - col2 and col3 is also present in Table B.

also col1 of Table A is present in Table B but with empty values.

so col2 and col3 values which are present in table B at col 8 and col9 are used to populate that column in table B with values of col1 in table A.

To perform this in spark using Java I have created two csv files for both tables. (Is this approach correct?) and loaded them in data frames.

Now I am not sure how to perform the above operation and update dataframe containing table B.

I hope it clarifies.

10
  • 1
    I think that you might want to illustrate what you are actually trying to do. i.e A small data sample. The logic to create that new column Commented Apr 20, 2018 at 6:38
  • @eliasah I tried to elaborate, but I cant share the actual data as it is of production, sorry for the ambiguity Commented Apr 20, 2018 at 6:50
  • It's still unclear for me. What is the final value expected in which columns ? Commented Apr 20, 2018 at 6:54
  • create dummy data, obfuscate stuff. You can't get help if you don't give us more info Commented Apr 20, 2018 at 10:03
  • @eliasah You are right I have mentioned CSV file structure in this question : stackoverflow.com/questions/49936260/… Commented Apr 20, 2018 at 10:29

1 Answer 1

0

You can use the JDBC driver to query table data and load into dataset and then use join operations on datasets to manipulate data based on your requirement.

    Dataset<Row> jdbcDF = SparkSessionBuilder.instance().getSparkSession().read().format("jdbc")
.option("url", ConnectionString)
.option("driver", driver)
.option("dbtable", Query)
.option("user", username)
.option("password", password)
.load();
Sign up to request clarification or add additional context in comments.

4 Comments

but what I meant is my current project is using tables, however if I want to implement this current functionality using spark do I still need to use tables?
No its up to you. Either you can use tables and load them into datasets to perform operations or you can simply use HDFS files to load data into dataframes and continue your processing.
currently I am doing POC on windows machine so I have created two csv files for two tables and loaded them to different dataframes, now I need to check if two columns values in df1 matches with the two column values in df2 then I need to update third column in df2, how can that be done? and is this approach correct?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.