I have multiple topics in kafka that I need to sink in their respective delta table.
A) 1 Streaming query for all topics
If i use one streaming query, then the RDD/DF should contains data from multiple topics. I could filter the dataframe i.e. create 1 df per topic after the read from kafka, and then do the write of each dataframe separately in its corresponding table.
I saw the following stackoverflow thread which all advocate for that appraoch here:
- Can I "branch" stream into many and write them in parallel in pyspark?
- What is the best way of reading multiple kafka topics in spark streaming?
- Spark Structured Streaming reading from multiple Kafka topics with multiple read streams
However there is a warning here
that suggest to push things down into forEachBatch because of the lineage or more specifically, because the source would be red multiple time if the filter is not pushed down in foreachBatch
This is repeated here
B) 1 Streaming query per topics
Everything is independent. I have a query per topic. This feels less efficient tho.
Questions:
A-1) What happen if some topics don't have data coming in frequently, and other have a lot of incoming data ? Can that affect the overall processing ?
A-2) I wonder if the order of the message is maintained in that scenario, which matter in my situation as they represent Entity Update (new version of an entity i.e. not the delta change of the entity).
A-B) What are the implication of each approach with respect to performance & concurrency, and is there a third option to deal with this scenario ? It seems to me that the outcome is very similar but i don't know the internal of the kafka source enough to make that call.