0

I am designing the Data Pipeline which consumes data from Salesforce using bulk API endpoint (pull mechanism).

The data comes and lands in an ADLS Gen2 Bronze Layer.

Next transformation job will start and clean the data and push to Silver layer ADLS Gen2. The transformation will be performed by Databricks.

Push the clean records to ADLS Gen2 Silver layer, then using Databricks, I push the clean records to another Databricks environment.

My questions are :

  • How to handle orchestration?

    I have to pull one time full data records, then every 1 hour incremental records where records detected if it is not already present.

    Then how to make sure once all the records have arrived, the transformation starts? The records were processed using Databricks.

  • How to make sure the next step after processing is push records in ADLS Gen2 Silver?

  • And lastly, how does Databricks know it has to move those records to instance B Databricks as shown in figure?

May someone please suggest how to achieve this.

Which option is scalable, reliable and can handle high throughput?

  • Option #1: connect and ingest using Azure function, orchestrated through ADF, Bronze to silver using Databricks
  • Option #2: connect and ingest using Databricks, orchestrated through ADF, Bronze to silver using Databricks [Native Databricks connector to SF Lakeflow]
  • Option #3: connect and ingest using ADF, orchestrated through ADF, Bronze to silver using Databricks [Native ADF connector to SF]
  • Option #4: connect and ingest using Databricks, orchestrated through Databricks, Bronze to silver using Databricks [no ADF at all]

Image : Logical Flow

Thanks a lot.

1
  • Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Commented Jul 15 at 14:35

1 Answer 1

1

For your scenario, the best approach is option 1 : use an azure function for ingestion, orchestrated end-to-end with azure data factory (adf), then transform bronze to silver using databricks.

  • adf will handle the full orchestration: triggering the azure function, checking file arrival in adls bronze, and kicking off databricks jobs only when all data is ready.

  • inside databricks, use a simple control table to log which files are processed this ensures transformations only run on complete data.

  • when transformation finishes, let adf move the cleaned data forward or trigger the next databricks job this chaining is reliable, scalable, and easy to monitor.

this pattern keeps ingestion, transformation, and orchestration loosely coupled but fully automated ideal for high-throughput pipelines.

references:

Sign up to request clarification or add additional context in comments.

2 Comments

Out of the below 4 option: Opt1: Connect and ingest using Azure function, orchestrated through ADF, Bronze to silver using Databricks Opt2: Connect and ingest using Databricks, orchestrated through ADF, Bronze to silver using Databricks [Native Databricks connector to SF Lakeflow] Opt3: Connect and ingest using ADF, orchestrated through ADF, Bronze to silver using Databricks [Native ADF connector to SF] Opt4: Connect and ingest using Databricks, orchestrated through Databricks, Bronze to silver using Databricks [no ADF]. which one should I prefer and what could be pros and cons. Please hel
option 1, is the strongest for most cases: azure function + adf gives you clean orchestration, flexible retry logic, easy monitoring, and full separation of concerns. databricks focuses only on heavy processing this keeps your pipeline modular and easy to scale.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.