1

I am new to apache spark. Could someone please walk me through an example explaining how data is loaded in spark application running in cluster mode. To be precise, when you start your application that is responsible to load data from a DB (it has got millions of records), is entire data going to be loaded first in the driver program or the function is actually passed to the executors to divide and load the data in each executor?

1 Answer 1

4

Driver coordinates workers and overall execution of tasks. so, driver splits a Spark application into tasks and schedules them to run on executors.

say, we want to load data from Datastore(that has sql engine). that work can be distributed across executors by using spark jdbc read method that is suitable for our need.

you can take a look on those read types here.. [https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.DataFrameReader][1]

with that, say we decided to go with 10 task to be used which needs to read the db data in parallel across workers, we can code that in spark program.

Assume there are 1000 records in the table (just for ex) and we want to read that in parallel over the column named "ID" which has values from 1 to 1000.

building a syntax like below and calling out action will read the data from db.

val resultDf = spark.read.format("jdbc").option("url", connectionUrl)
                                   .option("dbtable","(select * from table)")
                                   .option("user",devUserName)
                                   .option("password",devPassword)
                                   .option("numPartitions", 10)
                                   .option("partitionColumn", "ID")
                                   .option("lowerBound", 1)
                                   .option("upperBound", 1000)
                                   .load()

it constructs queries something like below that each task to work on datastore in parallel (assuming we have enough resources(cores) for this spark job) to fetch the data and resultDf dataframe is contructed out.

task 1 :select * from table where ID <= 100
task 2 :select * from table where ID > 100 AND ID <= 200
task 3 :select * from table where ID > 200 AND ID <= 300
....
....
task 4 :select * from table where ID > 900 AND ID <= 1000

It's up to us to settle out the strategy to know the right column to partition(partitionColumn) and number of partition(numPartitions) we need.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for the detailed explanation. So, if I understand correctly, driver won't load the data in memory and instead it will decide the strategy and will pass it to the executors to load the data and take actions.
true. executors are the one's working over there and if you do cache, read data do persisted in those corresponding executors memory. data will be transferred to driver only when we call method like collect(), in this case you are asking to return all the results to the driver node.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.