Java-Stream - Sort and Transform elements using groupingBy()

Question

Suppose I have the following domain class:

public class OrderRow {
    private Long orderId;
    private String action;
    private LocalDateTime timestamp;
    
    // getters, constructor, etc.
}

I have a following data set of OrderRows :

OrderId     Action                  Timestamp
3           Pay money               2015-05-27 12:48:47.000
3           Select Item             2015-05-27 12:44:47.000
1           Generate Payment        2015-05-27 12:55:47.000
2           Pay money               2015-05-27 12:48:47.000
2           Select Item             2015-05-27 12:44:47.000
2           Deliver                 2015-05-27 12:55:47.000
1           Generate Invoice        2015-05-27 12:48:47.000
1           Create PO               2015-05-27 12:44:47.000
3           Deliver                 2015-05-27 12:55:47.000

What I want to obtain the following Map from the sample data shown above:

[3] -> ["Select Item", "Pay money", "Deliver"]
[1] -> ["Create PO", "Generate Invoice", "Generate Payment"]
[2] -> ["Select Item", "Pay money", "Deliver"]

By performing below operations :

I want to groupBy orderId.
Sort actions by timestamp.
Create a Set (as there can be duplicates) of actions.

I am trying to do this in a single groupingBy operation as performing separate sorting, mapping operations take a lot of time if data set is huge.

I've tried to do the following:

orderRows.stream()
    .collect(Collectors.groupingBy(OrderRow::getOrderId, 
        Collectors.mapping(Function.identity(),
            Collectors.toCollection(
                () -> new TreeSet<>(Comparator.comparing(e -> e.timestamp))
    ))));

But then I get output as Map<String, Set<OrderRow>> where as I need the result of type Map<String, Set<String>>.

Would be really grateful if someone can show me at least a direction to go.

Note that is a critical operation and should be done in few milliseconds, hence performance is important.

Alexander Ivanchenko · Accepted Answer · 2022-12-08 20:10:56Z

1

TL;DR

LinkedHashSet - is the only implementation of the Set interface, which would to retain the order of the plain Strings (action property), that differs from the alphabetical one.
Timsort has a slightly better performance than sorting via Red-black tree (i.e. by storing elments into a TreeSet). Since OP has said that "is a critical operation", that worth to take into consideration.

You can sort the stream elements by timestamp before applying collect. That would guarantee that actions in the list would be consumed in the required order.

And to retain the order of these Strings, you can use a LinkedHashSet (TreeSet would not be helpful in this case since you need to keep the order by property which would not be present in the collection).

var actionsById = orderRows.stream()
    .sorted(Comparator.comparing(OrderRow::getTimestamp))
    .collect(Collectors.groupingBy(
        OrderRow::getOrderId,
        Collectors.mapping(
            OrderRow::getAction,
            Collectors.toCollection(LinkedHashSet::new))
    ));

Alternatively, you can sort the sets while grouping (as you were trying to do).

var actionsById = orderRows.stream()
    .collect(Collectors.groupingBy(
        OrderRow::getOrderId,
        Collectors.collectingAndThen(
            Collectors.mapping(
                Function.identity(),
                Collectors.toCollection(() -> new TreeSet<>(Comparator.comparing(OrderRow::getTimestamp)))
            ),
            set -> set.stream().map(OrderRow::getAction).collect(Collectors.toCollection(LinkedHashSet::new))
        )
    ));

Or instead of sorting via Red-Black Tree (TreeSet is backed by the an implementation of this data structure) each set can be sorted via Timsort (which is an algorithm used while sorting an array of elements, and sorted() operation is invoked in the stream, its elements are being dumped into an array).

Maintaining a Red-Black Tree is constful, and theoretically Timsort should perform a bit better.

var actionsById = orderRows.stream()
    .collect(Collectors.groupingBy(
        OrderRow::getOrderId,
        Collectors.collectingAndThen(
            Collectors.mapping(
                Function.identity(), Collectors.toList()
            ),
            list -> list.stream().sorted(Comparator.comparing(OrderRow::getTimestamp))
                .map(OrderRow::getAction)
                .collect(Collectors.toCollection(LinkedHashSet::new))
        )
    ));

To eliminate allocating a new array in memory (which happens when sorted() operation is being called) we can sort the lists produced by mapping() directly, and then create a stream out of sorted lists in the finisher function of collectingAndThen().

That would require writing a multiline lambda, but in this case is justifiable by the performance gain.

var actionsById = orderRows.stream()
    .collect(Collectors.groupingBy(
        OrderRow::getOrderId,
        Collectors.collectingAndThen(
            Collectors.mapping(
                Function.identity(), Collectors.toList()
            ),
            list -> {
                list.sort(Comparator.comparing(OrderRow::getTimestamp));
                return list.stream().map(OrderRow::getAction)
                    .collect(Collectors.toCollection(LinkedHashSet::new));
            }
        )
    ));

edited Dec 8, 2022 at 20:10

answered Dec 8, 2022 at 18:43

Alexander Ivanchenko

29.3k6 gold badges29 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

CovetousSlope Over a year ago

I did this exactly, but the time required for doing individual sorting before groupingBy is almost 6 times. This is a critical operation and should be done in few milliseconds. That is why I want to club everything together.

CovetousSlope Over a year ago

yes this is solving my problem. Only one thing here is , user of sorted() is adding some performance hit as compared to collectingAndThen( mapping(Function.identity(), toCollection( () -> new TreeSet<>(Comparator.comparing(e -> e.timestamp)) )), set -> set.stream().map(OrderRow::getEventName) .collect(toCollection(LinkedHashSet::new)) )))

Alexander Ivanchenko Over a year ago

@CovetousSlope There's only one reason I can see why Red-black tree perform better, namely sorted() requires allocating a new array in memory, which has a cost. And it that can be eliminated by sorting an ArrayList directly and then creating a stream out of it (I'll update the code in a few minutes). That said, to measure the performance correctly, you need to make sure that the test is isolated and not affected by JVM warm up and other stuff.

CovetousSlope Over a year ago

You are right. Seems that memory allocation can be an issue. Also I am running my code multiple times so that I have correct average time.

Collectives™ on Stack Overflow

Java-Stream - Sort and Transform elements using groupingBy()

1 Answer 1

TL;DR

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

TL;DR

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related