Apache Spark: How data is loaded in spark application?

Question

I am new to apache spark. Could someone please walk me through an example explaining how data is loaded in spark application running in cluster mode. To be precise, when you start your application that is responsible to load data from a DB (it has got millions of records), is entire data going to be loaded first in the driver program or the function is actually passed to the executors to divide and load the data in each executor?

Karthick · Accepted Answer · 2019-03-10 13:24:08Z

4

Driver coordinates workers and overall execution of tasks. so, driver splits a Spark application into tasks and schedules them to run on executors.

say, we want to load data from Datastore(that has sql engine). that work can be distributed across executors by using spark jdbc read method that is suitable for our need.

you can take a look on those read types here.. [https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.DataFrameReader][1]

with that, say we decided to go with 10 task to be used which needs to read the db data in parallel across workers, we can code that in spark program.

Assume there are 1000 records in the table (just for ex) and we want to read that in parallel over the column named "ID" which has values from 1 to 1000.

building a syntax like below and calling out action will read the data from db.

val resultDf = spark.read.format("jdbc").option("url", connectionUrl)
                                   .option("dbtable","(select * from table)")
                                   .option("user",devUserName)
                                   .option("password",devPassword)
                                   .option("numPartitions", 10)
                                   .option("partitionColumn", "ID")
                                   .option("lowerBound", 1)
                                   .option("upperBound", 1000)
                                   .load()

it constructs queries something like below that each task to work on datastore in parallel (assuming we have enough resources(cores) for this spark job) to fetch the data and resultDf dataframe is contructed out.

task 1 :select * from table where ID <= 100
task 2 :select * from table where ID > 100 AND ID <= 200
task 3 :select * from table where ID > 200 AND ID <= 300
....
....
task 4 :select * from table where ID > 900 AND ID <= 1000

It's up to us to settle out the strategy to know the right column to partition(partitionColumn) and number of partition(numPartitions) we need.

edited Mar 10, 2019 at 13:24

answered Mar 10, 2019 at 8:57

Karthick

6725 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

chandresh.v Over a year ago

Thank you for the detailed explanation. So, if I understand correctly, driver won't load the data in memory and instead it will decide the strategy and will pass it to the executors to load the data and take actions.

Karthick Over a year ago

true. executors are the one's working over there and if you do cache, read data do persisted in those corresponding executors memory. data will be transferred to driver only when we call method like collect(), in this case you are asking to return all the results to the driver node.

Collectives™ on Stack Overflow

Apache Spark: How data is loaded in spark application?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related