0

Consider spark jdbc dataframe to a rdbms table as given below

val df = spark.read("jdbc").option("url", url).option("dbtable", "schema.table").option("user", user).option("password",passwor).load()
df.count

This count action is not recomended since it will load data into spark layer and take count in the Spark layer instead of pushing down count query to jdbc datasource. What is the efficient way to get the count in this scenario?

1
  • Point is do you need these rows from that table subsequently? Commented Apr 2, 2020 at 8:47

1 Answer 1

2

Typically count will only be used once in your business logic (this is just an assumption), so the recommended way to do it is to use a standard jdbc connection and execute and sql statement that counts the rows. In this way it will be executed directly in the database and not through spark. Something like this might help you:

 val query = s"select count(*) from schema.table"
    val connection = getMySqlConnection(...)
    val rs = connection.createStatement().executeQuery(query)
    rs.next()
    val count = rs.getLong(1)
    connection.close()
    count
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you dumitru and @thebluephantom. My intention was to see if there is any work around.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.