0

I have a large csv file stored on Blob storage in azure. I want to load it into an zure sql database as quickly as possible.

Ive tried running SSIS in the DataFactory integration Runtime but it is quite slow as it is one thread/process.

What is the best way to parallelize the data load from an a csv in azure to sql database?

Note, i am ok with moving the csv to alternative storage such as datalake if necessary.

4
  • AWS has a built in feature for this...Azure does not. You'll have to script it. Commented Dec 6, 2018 at 20:14
  • What's your expected throughput? And how big is the source file? Commented Dec 7, 2018 at 1:09
  • File is anything from 500mb to a few Gb in size, with the test file being 1Gb in size. I have no expected throughput, I just want it to load as quickly as possible. its currently on cool blob storage, and taking about 5 minutes but I would like to get that down as much as possible. In an ideal world if 1 thread takes 5 minutes, 10 threads should load the file in 30 seconds. I know this ideal is rarely possible but parallelism has to help a bit. I just dont know how to go about it for a csv. Commented Dec 7, 2018 at 10:31
  • There are no native tool AFAIK that can read the CSV in parallel. Can you split the CSV in multiple files? If yes each file could be loaded in parallel into the same table. Commented Dec 8, 2018 at 19:07

1 Answer 1

1

The quickest way in Azure SQL is to use BULK operation (BULK INSERT or OPENROWSET BULK). You need to create an EXTERNAL DATA SOURCE in first place pointing to the Azure Blob Storage that contains the CSV you want to import, and then you can use BULK operation:

SELECT * FROM OPENROWSET(BULK ...)

A full explanation and sample is here:

https://medium.com/@mauridb/automatic-import-of-csv-data-using-azure-functions-and-azure-sql-63e1070963cf

the example describe how to import files dropped in a Blob Storage. Multiple file will be imported in parallel.

For a complete description of how to bulk import data from Azure Blob Storage to Azure SQL, there are a lot of samples in the official documentation

https://learn.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-2017#f-importing-data-from-a-file-in-azure-blob-storage

Another option is to use Azure Data Factory that will be as fast as using the BULK option just mentioned, but it requires the creation of an Azure Data Factory pipeline that adds some complexity to the solution...but, on the other hand, can be done without writing a single line of code.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.