1

I have the following scenario:

  • 4 wearable sensors attached on individuals.
  • Potentially infinite individuals.
  • A Kafka cluster.

I have to perform real-time processing on data streams on a cluster with a running instance of apache flink. Kafka is the data hub between flink cluster and sensors. Moreover, subject's streams are totally independent and also different streams belonging to same subject are independent each other.

I imagine this setup in my mind: I set a specific topic for each subject and each topic is partitioned in 4 partition, each one for each sensor on specific person. In this way I though to establish a consumer group for every topic.

Actually, my data amount is not so much big but mine interest is to build an easily scalable system. A day maybe I can have hundreds of individuals for instance...

My questions are:

  • Is this setup good? What do you think about it?
  • In this way I will have 4 kafka broker and each one handles a partition, right (without consider potential backups)?

Destroy me guys, and thanks in advance

1 Answer 1

1

You can't have an infinite number of topics in a Kafka cluster so if you plan to scale beyond 10,000 or more topics then you should consider another design. Instead of giving each individual a dedicated topic, you can use an individual's ID as a key and publish data as a key/value pair to a smaller number of topics. In Kafka you can have an (almost) infinite number of keys.

Also consider more partitions. Each of your 4 brokers can handle many partitions. If you only have 4 partitions in a topic then you can only have at most 4 consumers working together in parallel in a consumer group (in your case in Flink)

Sign up to request clarification or add additional context in comments.

3 Comments

In your opinion it is better if I distinguish subject on key basis directly with flink and add many partition without particular criteria to improve throughput, right? I imagined about 4 partition because I need to separate streams. I would each streams is evaluated by same consumer and consumer evaluate always same stream because I have to perform RT anomaly detection on running streams so I can't allow a consumer evaluate receive data from other streams.
Ok, I've investigated more on Doc and stackOv. If I understood, having infinite topics is bad but I can have lots of partitions: Maybe I can create just 4 topic: one for each sensor type and then add a partition (each topic) for each subject. So streams belonging to same user will be sent to specific partition using a custom partitioner. About consumers maybe I can assign specific partition to consumer? but then I need to add 4 consumer every time a new subject joins the system (so other 4 node in the flink cluster). Is this reasonable? I don't know if consumer groups can help me here.
The limits on topics also apply to partitions so you can't have unlimited partitions either. I would make 4 topics (for each type of sensor) publish all messages with the individual/subject ID as the key in a key/value message an then have as many partitions as you need to enable parallel consumption (100 partitions if you have 100 nodes running Flink). Flink will be able to aggregate and compute stats by key and in Kafka all data published with the same keyy is guaranteed to be stored to the same partition so all the data for any individual will always be in one partition.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.