Spark Sql JDBC to RDBMS get count of table efficiently

Question

Consider spark jdbc dataframe to a rdbms table as given below

val df = spark.read("jdbc").option("url", url).option("dbtable", "schema.table").option("user", user).option("password",passwor).load()
df.count

This count action is not recomended since it will load data into spark layer and take count in the Spark layer instead of pushing down count query to jdbc datasource. What is the efficient way to get the count in this scenario?

Point is do you need these rows from that table subsequently? — Ged
– Ged, Commented Apr 2, 2020 at 8:47

dumitru · Accepted Answer · 2020-04-02 06:30:45Z

2

Typically count will only be used once in your business logic (this is just an assumption), so the recommended way to do it is to use a standard jdbc connection and execute and sql statement that counts the rows. In this way it will be executed directly in the database and not through spark. Something like this might help you:

 val query = s"select count(*) from schema.table"
    val connection = getMySqlConnection(...)
    val rs = connection.createStatement().executeQuery(query)
    rs.next()
    val count = rs.getLong(1)
    connection.close()
    count

answered Apr 2, 2020 at 6:30

dumitru

2,11814 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ged Over a year ago

spark.apache.org/docs/latest/sql-data-sources-jdbc.html What about the query option?

Despicable me Over a year ago

Thank you dumitru and @thebluephantom. My intention was to see if there is any work around.

Collectives™ on Stack Overflow

Spark Sql JDBC to RDBMS get count of table efficiently

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related