Update existing records for deltatable using Azure Data flow

Question

I developed azure data factory pipeline/data flow to compare source data with existing data in delta table it matched 150 million of records. Data flow can't handle update huge volume of rows. Pipeline is error out the below message.

Job failed due to reason: at Sink 'updateDeltaTable': Failed to execute dataflow with internal server error, please retry later. If issue persists, please contact Microsoft support for further assistance.

Operation on target ForLoopControlTable failed: Activity failed because an inner activity failed; Inner activity name: DFCallSoftDelete, Error: {"StatusCode":"DF-Executor-InternalServerError","Message":"Job failed due to reason: at Sink 'updateDeltaTable': Failed to execute dataflow with internal server error, please retry later. If issue persists, please contact Microsoft support for further assistance","Details":"org.apache.spark.SparkException: Job aborted due to stage failure: Task 179 in stage 11.0 failed 1 times, most recent failure: Lost task 179.0 in stage 11.0 (TID 1910) (vm-67d14591 executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Container from a bad node: container_1698779096775_0001_01_000004 on host: vm-67d14591. Exit status: 143. Diagnostics: [2023-10-31 19:11:37.455]Container killed on request. Exit code is 143\n[2023-10-31 19:11:37.457]Container exited with a non-zero exit code 143. \n[2023-10-31 19:11:37.464]Killed by external signal\n.\nDriver stacktrace:\n\tat org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2313)\n\tat org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2262)\n\tat org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2261)\n\tat scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)\n\tat scala.collection.mutable.ResizableArray.for"}

Already google people says it needs to increase compute nodes.

FYI. Historical comparison, first time only huge volume subsequently it will be reduced.

Is there way we can update chunks of records like partition by year,Key or something else to achieve this.

Can you please suggest best way to implement.

DataFlow diagram

Pratik Lad · Accepted Answer · 2023-11-02 07:18:14Z

0

Job failed due to reason: at Sink 'updateDeltaTable': Failed to execute dataflow with internal server error, please retry later. If issue persists, please contact Microsoft support for further assistance.

The error might be caused by the cluster running out of disk space as you have large data need to be processed and you are using integration runtime with low compute size.

For this out of memory error, please retry using an integration runtime with bigger core count and/or memory optimized compute type is the possible resolution.

Or if you want to process the data into chucks first you need to store in chucks somewhere or then you can use that data to further process in chucks

You can use copy activity to store the data into chucks in csv file or Parquet file.

after this looping on these chunked files, you can update existing data in delta table with data flow.

answered Nov 2, 2023 at 7:18

Pratik Lad

8,7282 gold badges5 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Sunny Over a year ago

Thanks Pratik. Do i need to store each file 10000 rows and then loop through file by file update existing target delta table.

Pratik Lad Over a year ago

Yes the no of rows in file it's upto you as you want as many as recordes in single file.

Collectives™ on Stack Overflow

Update existing records for deltatable using Azure Data flow

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related