Spark: processing multiple kafka topic in parallel

Question

I am using spark 1.5.2. I need to run spark streaming job with kafka as the streaming source. I need to read from multiple topics within kafka and process each topic differently.

Is it a good idea to do this in the same job? If so, should I create a single stream with multiple partitions or different streams for each topic?
I am using Kafka direct steam. As far as I know, spark launches long-running receivers for each partition. I have a relatively small cluster, 6 nodes with 4 cores each. If I have many topics and partitions in each topic, would the efficiency be impacted as most executors are busy with long-running receivers? Please correct me if my understanding is wrong here

Prasad Khode · Accepted Answer · 2016-08-23 10:07:46Z

21

I made the following observations, in case its helpful for someone:

In kafka direct stream, the receivers are not run as long running tasks. At the beginning of each batch inerval, first the data is read from kafka in executors. Once read, the processing part takes over.
If we create a single stream with multiple topics, the topics are read one after the other. Also, filtering the dstream for applying different processing logic would add another step to the job
Creating multiple streams would help in two ways: 1. You don't need to apply the filter operation to process different topics differently. 2. You can read multiple streams in parallel (as opposed to one by one in case of single stream). To do so, there is an undocumented config parameter spark.streaming.concurrentJobs*. So, I decided to create multiple streams.
```
sparkConf.set("spark.streaming.concurrentJobs", "4");
```

edited Aug 23, 2016 at 10:07

Prasad Khode

6,77712 gold badges48 silver badges62 bronze badges

answered Dec 24, 2015 at 6:32

nish

7,30019 gold badges80 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Stephane Maarek Over a year ago

@CodyKoeninger, at which level do we need to go down until we know all the elements within the containers are from the same topic? I.e., within an RDD, am I guaranteed to have all records to be from the same topic? or is it at the partition level? In this case, is there a high level API that exposes it?

Cody Koeninger Over a year ago

@Stephane Until you do a transformation, RDD partitions of the direct stream are 1:1 with kafka topicpartitions. see github.com/koeninger/kafka-exactly-once

ASe Over a year ago

@prasad-khode - where I can find relevance for this "If we create a single stream with multiple topics, the topics are read one after the other" , looks like it is not documented part of kafka spark stream.

nish Over a year ago

@ASe There is no documentation, but its what I had observed by printing messages on stdout.

Haytam Over a year ago

How did you start multiple streams?

|

Atul Soman · Accepted Answer · 2015-12-23 11:02:24Z

I think the right solution depends on your use case.

If your processing logic is the same for data from all topics, then without doubt, this is a better approach.

If the processing logic is different, i guess you get a single RDD from all the topics and you have to create a pairedrdd for each processing logic and handle it separately. The problem is that this creates a sort of grouping to processing and the overall processing speed will be determined by the topic which needs the longest time to process. So topics with less data have to wait till data from all topics are processed. One advantage is that if its a timeseries data, then the processing proceeds together which might be a good thing.

Another advantage of running independent jobs is that you get better control and can adjust your resource sharing. For eg: jobs which process topic with high throughput can be allocated a higher CPU/memory.

best wishes · Accepted Answer · 2023-10-04 11:16:29Z

Spark structured streaming is resolving this exact problem. Add below 3 dependencies in your pom.xml

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
            <version>2.4.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.4.4</version>
        </dependency>

Below is a sample code to replicate the scenario

import ch.qos.logback.classic.Level;
import ch.qos.logback.classic.Logger;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.OutputMode;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.util.Objects;

public class StructuredSparkStreamingExample {
    private static final String SPARK_SQL_STREAMING_CHECKPOINT_LOCATION_CONFIG = "spark.sql.streaming.checkpointLocation";
    private static final String SPARK_SQL_STREAMING_CHECKPOINT_LOCATION = "/tmp/checkpoints";

    public static void main(String[] args) throws InterruptedException {
        try {
            new StructuredSparkStreamingExample().initSparkSession();
        } catch (StreamingQueryException | IOException | InterruptedException e) {
            e.printStackTrace();
        }
    }

    public void initSparkSession() throws IOException, InterruptedException, StreamingQueryException {
        SparkSession spark = SparkSession.builder().master("local[*]").appName("StructuredSparkStreamingExample")
                .config(SPARK_SQL_STREAMING_CHECKPOINT_LOCATION_CONFIG, SPARK_SQL_STREAMING_CHECKPOINT_LOCATION)
                .getOrCreate();
        Logger root = (Logger) LoggerFactory.getLogger(org.slf4j.Logger.ROOT_LOGGER_NAME);
        root.setLevel(Level.ERROR);

        Dataset<Row> datasetsUserDetails = spark
                .readStream()
                .format("kafka")
                .option("kafka.bootstrap.servers", "localhost:9092")
                .option("subscribe", "userDetails")
                .option("maxOffsetsPerTrigger", 2)
                .option("maxTriggerDelay", "5s")
                .option("startingOffsets", "latest").load();

        Dataset<Row> datasetUserName = spark
                .readStream()
                .format("kafka")
                .option("kafka.bootstrap.servers", "localhost:9092")
                .option("subscribe", "userName")
                .option("maxOffsetsPerTrigger", 2)
                .option("maxTriggerDelay", "5s")
                .option("startingOffsets", "latest").load();

        StreamingQuery sqUserDetails = datasetsUserDetails.selectExpr("CAST(value AS STRING)").filter(Objects::nonNull).selectExpr(new String[]{"CAST(value AS STRING)"})
                .map((MapFunction<Row, String>) row -> {
                    System.out.println("sleeping");
                    Thread.sleep(10000);
                    System.out.println("waking up");
                    return "100";}, Encoders.STRING())
                .writeStream()
                .outputMode(OutputMode.Update())
                .format("kafka")
                .option("kafka.bootstrap.servers", "localhost:9092")
                .option("kafka.max.request.size", "200000000")
                .option("topic", "temp")
//                .trigger(Trigger.Continuous(1000)) //only works when deployed on spark cluster. checkout https://stackoverflow.com/a/69469773
                .start();
//
        datasetUserName.selectExpr("CAST(value AS STRING)").filter(Objects::nonNull).selectExpr(new String[]{"CAST(value AS STRING)"})
                .writeStream()
                .outputMode(OutputMode.Update())
                .format("kafka")
                .option("kafka.bootstrap.servers", "localhost:9092")
                .option("kafka.max.request.size", "200000000")
                .option("topic", "temp2")
//                .trigger(Trigger.Continuous(1000)) //only works when deployed on spark cluster. checkout https://stackoverflow.com/a/69469773
                .start();

        sqUserDetails.awaitTermination();
    }
}

Even if sqUserDetails is blocked for 10seconds (mimicing some remote call) that does not block datasetUserName in proceeding.

Collectives™ on Stack Overflow

Spark: processing multiple kafka topic in parallel

3 Answers 3

6 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related