0

I developed azure data factory pipeline/data flow to compare source data with existing data in delta table it matched 150 million of records. Data flow can't handle update huge volume of rows. Pipeline is error out the below message.

Stages

Job failed due to reason: at Sink 'updateDeltaTable': Failed to execute dataflow with internal server error, please retry later. If issue persists, please contact Microsoft support for further assistance.

Operation on target ForLoopControlTable failed: Activity failed because an inner activity failed; Inner activity name: DFCallSoftDelete, Error: {"StatusCode":"DF-Executor-InternalServerError","Message":"Job failed due to reason: at Sink 'updateDeltaTable': Failed to execute dataflow with internal server error, please retry later. If issue persists, please contact Microsoft support for further assistance","Details":"org.apache.spark.SparkException: Job aborted due to stage failure: Task 179 in stage 11.0 failed 1 times, most recent failure: Lost task 179.0 in stage 11.0 (TID 1910) (vm-67d14591 executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Container from a bad node: container_1698779096775_0001_01_000004 on host: vm-67d14591. Exit status: 143. Diagnostics: [2023-10-31 19:11:37.455]Container killed on request. Exit code is 143\n[2023-10-31 19:11:37.457]Container exited with a non-zero exit code 143. \n[2023-10-31 19:11:37.464]Killed by external signal\n.\nDriver stacktrace:\n\tat org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2313)\n\tat org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2262)\n\tat org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2261)\n\tat scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)\n\tat scala.collection.mutable.ResizableArray.for"}

Already google people says it needs to increase compute nodes.

FYI. Historical comparison, first time only huge volume subsequently it will be reduced.

Is there way we can update chunks of records like partition by year,Key or something else to achieve this.

Can you please suggest best way to implement.

DataFlow diagram

1 Answer 1

0

Job failed due to reason: at Sink 'updateDeltaTable': Failed to execute dataflow with internal server error, please retry later. If issue persists, please contact Microsoft support for further assistance.

The error might be caused by the cluster running out of disk space as you have large data need to be processed and you are using integration runtime with low compute size.

For this out of memory error, please retry using an integration runtime with bigger core count and/or memory optimized compute type is the possible resolution.

Or if you want to process the data into chucks first you need to store in chucks somewhere or then you can use that data to further process in chucks

  • You can use copy activity to store the data into chucks in csv file or Parquet file. enter image description here

after this looping on these chunked files, you can update existing data in delta table with data flow.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks Pratik. Do i need to store each file 10000 rows and then loop through file by file update existing target delta table.
Yes the no of rows in file it's upto you as you want as many as recordes in single file.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.