8

I want to write around 10 GB of data everyday to Azure SQL server DB using PySpark.Currently using JDBC driver which takes hours making insert statements one by one.

I am planning to use azure-sqldb-spark connector which claims to turbo boost the write using bulk insert.

I went through the official doc: https://github.com/Azure/azure-sqldb-spark. The library is written in scala and basically requires the use of 2 scala classes :

import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._

val bulkCopyConfig = Config(Map(
  "url"               -> "mysqlserver.database.windows.net",
  "databaseName"      -> "MyDatabase",
  "user"              -> "username",
  "password"          -> "*********",
  "databaseName"      -> "MyDatabase",
  "dbTable"           -> "dbo.Clients",
  "bulkCopyBatchSize" -> "2500",
  "bulkCopyTableLock" -> "true",
  "bulkCopyTimeout"   -> "600"
))

df.bulkCopyToSqlDB(bulkCopyConfig)

Can it be implemented in used in pyspark like this (using sc._jvm):

Config = sc._jvm.com.microsoft.azure.sqldb.spark.config.Config
connect= sc._jvm.com.microsoft.azure.sqldb.spark.connect._

//all config

df.connect.bulkCopyToSqlDB(bulkCopyConfig)

I am not an expert in Python. Can anybody help me with the complete snippet to get this done.

5
  • What help are you expecting ? Commented Oct 27, 2018 at 10:51
  • 1
    How to use azure-sqldb-spark connector in pyspark? I know it can be done in scala but my entire project is in python. Commented Oct 29, 2018 at 7:36
  • I think we don't have any examples yet please subscribe to this issue - github.com/Azure/azure-sqldb-spark/issues/20 Commented Oct 30, 2018 at 6:26
  • Hey @AjayKumar How do you overcome performance issue in puspark? i am currently running into performance issue. Can you help me? Commented Sep 18, 2019 at 12:46
  • @AjayKumar The project in the github link you referenced is not being actively maintained anymore. Instead use the project in this link. Microsoft encourages us to use this project that has support for Python and R bindings, an easier-to use interface to bulk insert data, and many other improvements. Commented Apr 23, 2022 at 17:05

1 Answer 1

7

The Spark connector currently (as of march 2019) only supports the Scala API (as documented here). So if you are working in a notebook, you could do all the preprocessing in python, finally register the dataframe as a temp table, e. g. :

df.createOrReplaceTempView('testbulk')

and have to do the final step in Scala:

%scala
//configs...
spark.table("testbulk").bulkCopyToSqlDB(bulkCopyConfig)
Sign up to request clarification or add additional context in comments.

4 Comments

This works well. Before the connector is implemented in Pyspark, this workaround should do the job.
@huichen do you know how to add 'ldap" authorization in?
You mean add ldap auth to the cluster? You can try to add it in the init script so every time when the cluster is started, it will be installed.
@huichen can you elaborate this please

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.