Hint about kafka cluster setup

Question

I have the following scenario:

4 wearable sensors attached on individuals.
Potentially infinite individuals.
A Kafka cluster.

I have to perform real-time processing on data streams on a cluster with a running instance of apache flink. Kafka is the data hub between flink cluster and sensors. Moreover, subject's streams are totally independent and also different streams belonging to same subject are independent each other.

I imagine this setup in my mind: I set a specific topic for each subject and each topic is partitioned in 4 partition, each one for each sensor on specific person. In this way I though to establish a consumer group for every topic.

Actually, my data amount is not so much big but mine interest is to build an easily scalable system. A day maybe I can have hundreds of individuals for instance...

My questions are:

Is this setup good? What do you think about it?
In this way I will have 4 kafka broker and each one handles a partition, right (without consider potential backups)?

Destroy me guys, and thanks in advance

Hans Jespersen · Accepted Answer · 2017-06-10 06:08:59Z

1

You can't have an infinite number of topics in a Kafka cluster so if you plan to scale beyond 10,000 or more topics then you should consider another design. Instead of giving each individual a dedicated topic, you can use an individual's ID as a key and publish data as a key/value pair to a smaller number of topics. In Kafka you can have an (almost) infinite number of keys.

Also consider more partitions. Each of your 4 brokers can handle many partitions. If you only have 4 partitions in a topic then you can only have at most 4 consumers working together in parallel in a consumer group (in your case in Flink)

edited Jun 10, 2017 at 6:08

answered Jun 10, 2017 at 6:02

Hans Jespersen

8,3751 gold badge26 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Akinn Over a year ago

In your opinion it is better if I distinguish subject on key basis directly with flink and add many partition without particular criteria to improve throughput, right? I imagined about 4 partition because I need to separate streams. I would each streams is evaluated by same consumer and consumer evaluate always same stream because I have to perform RT anomaly detection on running streams so I can't allow a consumer evaluate receive data from other streams.

Akinn Over a year ago

Ok, I've investigated more on Doc and stackOv. If I understood, having infinite topics is bad but I can have lots of partitions: Maybe I can create just 4 topic: one for each sensor type and then add a partition (each topic) for each subject. So streams belonging to same user will be sent to specific partition using a custom partitioner. About consumers maybe I can assign specific partition to consumer? but then I need to add 4 consumer every time a new subject joins the system (so other 4 node in the flink cluster). Is this reasonable? I don't know if consumer groups can help me here.

Hans Jespersen Over a year ago

The limits on topics also apply to partitions so you can't have unlimited partitions either. I would make 4 topics (for each type of sensor) publish all messages with the individual/subject ID as the key in a key/value message an then have as many partitions as you need to enable parallel consumption (100 partitions if you have 100 nodes running Flink). Flink will be able to aggregate and compute stats by key and in Kafka all data published with the same keyy is guaranteed to be stored to the same partition so all the data for any individual will always be in one partition.

Collectives™ on Stack Overflow

Hint about kafka cluster setup

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related