How to use azure-sqldb-spark connector in pyspark

Question

I want to write around 10 GB of data everyday to Azure SQL server DB using PySpark.Currently using JDBC driver which takes hours making insert statements one by one.

I am planning to use azure-sqldb-spark connector which claims to turbo boost the write using bulk insert.

I went through the official doc: https://github.com/Azure/azure-sqldb-spark. The library is written in scala and basically requires the use of 2 scala classes :

import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._

val bulkCopyConfig = Config(Map(
  "url"               -> "mysqlserver.database.windows.net",
  "databaseName"      -> "MyDatabase",
  "user"              -> "username",
  "password"          -> "*********",
  "databaseName"      -> "MyDatabase",
  "dbTable"           -> "dbo.Clients",
  "bulkCopyBatchSize" -> "2500",
  "bulkCopyTableLock" -> "true",
  "bulkCopyTimeout"   -> "600"
))

df.bulkCopyToSqlDB(bulkCopyConfig)

Can it be implemented in used in pyspark like this (using sc._jvm):

Config = sc._jvm.com.microsoft.azure.sqldb.spark.config.Config
connect= sc._jvm.com.microsoft.azure.sqldb.spark.connect._

//all config

df.connect.bulkCopyToSqlDB(bulkCopyConfig)

I am not an expert in Python. Can anybody help me with the complete snippet to get this done.

How to use azure-sqldb-spark connector in pyspark? I know it can be done in scala but my entire project is in python. — Ajay Kumar
– Ajay Kumar, Commented Oct 29, 2018 at 7:36
I think we don't have any examples yet please subscribe to this issue - github.com/Azure/azure-sqldb-spark/issues/20 — Sundeep
– Sundeep, Commented Oct 30, 2018 at 6:26
Hey @AjayKumar How do you overcome performance issue in puspark? i am currently running into performance issue. Can you help me? — Tharunkumar Reddy
– Tharunkumar Reddy, Commented Sep 18, 2019 at 12:46
@AjayKumar The project in the github link you referenced is not being actively maintained anymore. Instead use the project in this link. Microsoft encourages us to use this project that has support for Python and R bindings, an easier-to use interface to bulk insert data, and many other improvements. — nam
– nam, Commented Apr 23, 2022 at 17:05

Hauke Mallow · Accepted Answer · 2019-03-15 10:07:11Z

7

The Spark connector currently (as of march 2019) only supports the Scala API (as documented here). So if you are working in a notebook, you could do all the preprocessing in python, finally register the dataframe as a temp table, e. g. :

df.createOrReplaceTempView('testbulk')

and have to do the final step in Scala:

%scala
//configs...
spark.table("testbulk").bulkCopyToSqlDB(bulkCopyConfig)

answered Mar 15, 2019 at 10:07

Hauke Mallow

3,2423 gold badges14 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

hui chen Over a year ago

This works well. Before the connector is implemented in Pyspark, this workaround should do the job.

Maria Nazari Over a year ago

@huichen do you know how to add 'ldap" authorization in?

hui chen Over a year ago

You mean add ldap auth to the cluster? You can try to add it in the init script so every time when the cluster is started, it will be installed.

Chetan_Vasudevan Over a year ago

@huichen can you elaborate this please

Collectives™ on Stack Overflow

How to use azure-sqldb-spark connector in pyspark

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related