I am designing the Data Pipeline which consumes data from Salesforce using bulk API endpoint (pull mechanism).
The data comes and lands in an ADLS Gen2 Bronze Layer.
Next transformation job will start and clean the data and push to Silver layer ADLS Gen2. The transformation will be performed by Databricks.
Push the clean records to ADLS Gen2 Silver layer, then using Databricks, I push the clean records to another Databricks environment.
My questions are :
How to handle orchestration?
I have to pull one time full data records, then every 1 hour incremental records where records detected if it is not already present.
Then how to make sure once all the records have arrived, the transformation starts? The records were processed using Databricks.
How to make sure the next step after processing is push records in ADLS Gen2 Silver?
And lastly, how does Databricks know it has to move those records to instance B Databricks as shown in figure?
May someone please suggest how to achieve this.
Which option is scalable, reliable and can handle high throughput?
- Option #1: connect and ingest using Azure function, orchestrated through ADF, Bronze to silver using Databricks
- Option #2: connect and ingest using Databricks, orchestrated through ADF, Bronze to silver using Databricks [Native Databricks connector to SF Lakeflow]
- Option #3: connect and ingest using ADF, orchestrated through ADF, Bronze to silver using Databricks [Native ADF connector to SF]
- Option #4: connect and ingest using Databricks, orchestrated through Databricks, Bronze to silver using Databricks [no ADF at all]
Image : Logical Flow
Thanks a lot.