We are running a few flink jobs, all of which have a kafka source and multiple cassandra sinks. We are heavily relying on time windows with reduce function, on keyed data.
Our TPS is currently around 100—200.
I have a few questions about checkpoints and the size of the state that being saved:
Since we're using reduce function, is the state size only influenced by the number of opened windows? If an hourly window and a minute window both have same accumaltor, should we expect a similar state size? For some reason were seeing that hourly window has much larger state than minute window, and daily window has larger state than hourly window.
What is considered to be a reasonable amount of opened windows? What is considered to be a large state? What are the most common checkpoint time intervals (ours is 5 seconds which seems far too often to me), how long should we expect a checkpoint save time to take in a reasonable storage, for 1 gb of state? How TBs of state (which i read some system has) can be checkpointed in a reasonable amount of time? I know these are abstract questions but were not sure that our flink setup is working as expected and what to expect as our data grows.
Were seeing both async and sync checkpoint times in the UI. Can anyone explain why flink is using both?
Thanks for anyone who can help with any of the questions.