I have multiple topics in kafka that I need to sink in their respective delta table.

A) 1 Streaming query for all topics

If i use one streaming query, then the RDD/DF should contains data from multiple topics. I could filter the dataframe i.e. create 1 df per topic after the read from kafka, and then do the write of each dataframe separately in its corresponding table.

I saw the following stackoverflow thread which all advocate for that appraoch here:

However there is a warning here

Spark structured streaming app reading from multiple Kafka topics

that suggest to push things down into forEachBatch because of the lineage or more specifically, because the source would be red multiple time if the filter is not pushed down in foreachBatch

This is repeated here

https://medium.com/globant/multiple-sinks-in-spark-structured-streaming-38997d9a59e9

B) 1 Streaming query per topics

Everything is independent. I have a query per topic. This feels less efficient tho.

Questions:

A-1) What happen if some topics don't have data coming in frequently, and other have a lot of incoming data ? Can that affect the overall processing ?

A-2) I wonder if the order of the message is maintained in that scenario, which matter in my situation as they represent Entity Update (new version of an entity i.e. not the delta change of the entity).

A-B) What are the implication of each approach with respect to performance & concurrency, and is there a third option to deal with this scenario ? It seems to me that the outcome is very similar but i don't know the internal of the kafka source enough to make that call.

edited May 26, 2024 at 19:49

Ged

18.5k8 gold badges53 silver badges108 bronze badges

asked May 25, 2024 at 10:51

MaatDeamon

9,85915 gold badges72 silver badges153 bronze badges

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Multiple Kafka source topic + Spark Structured streaming + multiple delta table sink

A) 1 Streaming query for all topics

B) 1 Streaming query per topics

Questions:

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

A) 1 Streaming query for all topics

B) 1 Streaming query per topics

Questions:

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked