In Flink, why is aggregation not supported by DataStream

Question

I am a newbie to Flink. Sometimes there are cases where I want to do aggregation on a DataStream without needed to do a keyBy first. Why doesn't Flink support aggregation (sum, min, max, etc.) on a DataStream?

Thank you, Ahmed.

You can comment or upvote or accept the answer if you find this is useful. Else the question may not be useful for the future viewers. — Jaya Ananthram
– Jaya Ananthram, Commented Mar 19, 2021 at 10:36

Jaya Ananthram · Accepted Answer · 2021-03-16 07:52:50Z

2

Flink supports aggregation for the non-keyed stream, but you have to apply windowAll operation first then you can apply the aggregation. windowAll function will reduce the parallelism value to 1, meaning all the data will flow through the single task slot. This is by design because when you have more than one task slot, you can do the aggregation only for the stream of data that are available in that slot, not for across slot.

If your use case doesn't fit to use windowAll with parallelism one (ie-when you have more number of records from source), then you can try to apply the keyBy function then aggregation, this will get the aggregated result for the set of keys then again windowAll and finally aggregate function. This way you are doing aggregation by key in a different task slot then finally aggregation on the reduced data in a single task slot.

Following is an example for windowAll without keyBy operation,

environment.fromCollection(list)
.windowAll(TumblingEventTimeWindows.of(Time.seconds(5)))
.max(1)

Following is an example for windowAll after keyBy operation,

environment.fromCollection(list)
.keyBy(1)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.maxBy(1)
.windowAll(TumblingEventTimeWindows.of(Time.seconds(5)))
.max(1)

Reference for the documentation - here

answered Mar 16, 2021 at 7:52

Jaya Ananthram

3,4631 gold badge25 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Json Over a year ago

I think the quiestion poster may want to know if there is anyway to preaggregate the events before key by and avoid shuflling and improve performance which keyBy will trigger. This scenario is quite common and is not well supported in Flink and there is no out of box solution though you can write your own solution based on some low level API

David Anderson · Accepted Answer · 2021-03-16 09:13:47Z

1

With FLIP-134 the Flink community has decided to deprecate all of these relational methods from the DataStream API:

DataStream#project
Windowed/KeyedStream#sum,min,max,minBy,maxBy
DataStream#keyBy where the key specified with field name or index (including ConnectedStreams#keyBy)

The rationale behind this decision is that Table/SQL is a more complete and more performant relational API, and it already supports both batch and streaming. With these APIs you can easily perform global aggregations, without having to first do a keyBy or GROUP BY.

An example:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);

SingleOutputStreamOperator<Integer> numbers = env.fromElements(0, 1, 1, 0, 3, 2);

Table data = tableEnv.fromDataStream(numbers, $("n"));

Table results = data.select($("n").max());

tableEnv
        .toRetractStream(results, Row.class)
        .print();

env.execute();

edited Mar 16, 2021 at 9:13

answered Mar 16, 2021 at 8:51

David Anderson

44.3k4 gold badges41 silver badges73 bronze badges

2 Comments

Jaya Ananthram Over a year ago

How does this internally work interms of task slot utilisation? Only 1 throughout or more than 1 then finally reduce to 1?

David Anderson Over a year ago

I don't believe the optimizer is smart enough to do a parallel pre-aggregation first. You can examine the execution plan and check, but if you want the optimized version I suspect you'll have to do it yourself.

Collectives™ on Stack Overflow

In Flink, why is aggregation not supported by DataStream

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related