I am new to apache spark. Could someone please walk me through an example explaining how data is loaded in spark application running in cluster mode. To be precise, when you start your application that is responsible to load data from a DB (it has got millions of records), is entire data going to be loaded first in the driver program or the function is actually passed to the executors to divide and load the data in each executor?
1 Answer
Driver coordinates workers and overall execution of tasks. so, driver splits a Spark application into tasks and schedules them to run on executors.
say, we want to load data from Datastore(that has sql engine). that work can be distributed across executors by using spark jdbc read method that is suitable for our need.
you can take a look on those read types here.. [https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.DataFrameReader][1]
with that, say we decided to go with 10 task to be used which needs to read the db data in parallel across workers, we can code that in spark program.
Assume there are 1000 records in the table (just for ex) and we want to read that in parallel over the column named "ID" which has values from 1 to 1000.
building a syntax like below and calling out action will read the data from db.
val resultDf = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable","(select * from table)")
.option("user",devUserName)
.option("password",devPassword)
.option("numPartitions", 10)
.option("partitionColumn", "ID")
.option("lowerBound", 1)
.option("upperBound", 1000)
.load()
it constructs queries something like below that each task to work on datastore in parallel (assuming we have enough resources(cores) for this spark job) to fetch the data and resultDf dataframe is contructed out.
task 1 :select * from table where ID <= 100
task 2 :select * from table where ID > 100 AND ID <= 200
task 3 :select * from table where ID > 200 AND ID <= 300
....
....
task 4 :select * from table where ID > 900 AND ID <= 1000
It's up to us to settle out the strategy to know the right column to partition(partitionColumn) and number of partition(numPartitions) we need.