1

Suppose I have the following domain class:

public class OrderRow {
    private Long orderId;
    private String action;
    private LocalDateTime timestamp;
    
    // getters, constructor, etc.
}

I have a following data set of OrderRows :

OrderId     Action                  Timestamp
3           Pay money               2015-05-27 12:48:47.000
3           Select Item             2015-05-27 12:44:47.000
1           Generate Payment        2015-05-27 12:55:47.000
2           Pay money               2015-05-27 12:48:47.000
2           Select Item             2015-05-27 12:44:47.000
2           Deliver                 2015-05-27 12:55:47.000
1           Generate Invoice        2015-05-27 12:48:47.000
1           Create PO               2015-05-27 12:44:47.000
3           Deliver                 2015-05-27 12:55:47.000

What I want to obtain the following Map from the sample data shown above:

[3] -> ["Select Item", "Pay money", "Deliver"]
[1] -> ["Create PO", "Generate Invoice", "Generate Payment"]
[2] -> ["Select Item", "Pay money", "Deliver"]  

By performing below operations :

  1. I want to groupBy orderId.
  2. Sort actions by timestamp.
  3. Create a Set (as there can be duplicates) of actions.

I am trying to do this in a single groupingBy operation as performing separate sorting, mapping operations take a lot of time if data set is huge.

I've tried to do the following:

orderRows.stream()
    .collect(Collectors.groupingBy(OrderRow::getOrderId, 
        Collectors.mapping(Function.identity(),
            Collectors.toCollection(
                () -> new TreeSet<>(Comparator.comparing(e -> e.timestamp))
    ))));

But then I get output as Map<String, Set<OrderRow>> where as I need the result of type Map<String, Set<String>>.

Would be really grateful if someone can show me at least a direction to go.

Note that is a critical operation and should be done in few milliseconds, hence performance is important.

1 Answer 1

1

TL;DR

  • LinkedHashSet - is the only implementation of the Set interface, which would to retain the order of the plain Strings (action property), that differs from the alphabetical one.

  • Timsort has a slightly better performance than sorting via Red-black tree (i.e. by storing elments into a TreeSet). Since OP has said that "is a critical operation", that worth to take into consideration.


You can sort the stream elements by timestamp before applying collect. That would guarantee that actions in the list would be consumed in the required order.

And to retain the order of these Strings, you can use a LinkedHashSet (TreeSet would not be helpful in this case since you need to keep the order by property which would not be present in the collection).

var actionsById = orderRows.stream()
    .sorted(Comparator.comparing(OrderRow::getTimestamp))
    .collect(Collectors.groupingBy(
        OrderRow::getOrderId,
        Collectors.mapping(
            OrderRow::getAction,
            Collectors.toCollection(LinkedHashSet::new))
    ));

Alternatively, you can sort the sets while grouping (as you were trying to do).

var actionsById = orderRows.stream()
    .collect(Collectors.groupingBy(
        OrderRow::getOrderId,
        Collectors.collectingAndThen(
            Collectors.mapping(
                Function.identity(),
                Collectors.toCollection(() -> new TreeSet<>(Comparator.comparing(OrderRow::getTimestamp)))
            ),
            set -> set.stream().map(OrderRow::getAction).collect(Collectors.toCollection(LinkedHashSet::new))
        )
    ));

Or instead of sorting via Red-Black Tree (TreeSet is backed by the an implementation of this data structure) each set can be sorted via Timsort (which is an algorithm used while sorting an array of elements, and sorted() operation is invoked in the stream, its elements are being dumped into an array).

Maintaining a Red-Black Tree is constful, and theoretically Timsort should perform a bit better.

var actionsById = orderRows.stream()
    .collect(Collectors.groupingBy(
        OrderRow::getOrderId,
        Collectors.collectingAndThen(
            Collectors.mapping(
                Function.identity(), Collectors.toList()
            ),
            list -> list.stream().sorted(Comparator.comparing(OrderRow::getTimestamp))
                .map(OrderRow::getAction)
                .collect(Collectors.toCollection(LinkedHashSet::new))
        )
    ));

To eliminate allocating a new array in memory (which happens when sorted() operation is being called) we can sort the lists produced by mapping() directly, and then create a stream out of sorted lists in the finisher function of collectingAndThen().

That would require writing a multiline lambda, but in this case is justifiable by the performance gain.

var actionsById = orderRows.stream()
    .collect(Collectors.groupingBy(
        OrderRow::getOrderId,
        Collectors.collectingAndThen(
            Collectors.mapping(
                Function.identity(), Collectors.toList()
            ),
            list -> {
                list.sort(Comparator.comparing(OrderRow::getTimestamp));
                return list.stream().map(OrderRow::getAction)
                    .collect(Collectors.toCollection(LinkedHashSet::new));
            }
        )
    ));
Sign up to request clarification or add additional context in comments.

4 Comments

I did this exactly, but the time required for doing individual sorting before groupingBy is almost 6 times. This is a critical operation and should be done in few milliseconds. That is why I want to club everything together.
yes this is solving my problem. Only one thing here is , user of sorted() is adding some performance hit as compared to collectingAndThen( mapping(Function.identity(), toCollection( () -> new TreeSet<>(Comparator.comparing(e -> e.timestamp)) )), set -> set.stream().map(OrderRow::getEventName) .collect(toCollection(LinkedHashSet::new)) )))
@CovetousSlope There's only one reason I can see why Red-black tree perform better, namely sorted() requires allocating a new array in memory, which has a cost. And it that can be eliminated by sorting an ArrayList directly and then creating a stream out of it (I'll update the code in a few minutes). That said, to measure the performance correctly, you need to make sure that the test is isolated and not affected by JVM warm up and other stuff.
You are right. Seems that memory allocation can be an issue. Also I am running my code multiple times so that I have correct average time.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.