2

I have the following function which deals with a series of search events which need to be grouped together in search flows in case they are related.

  def split(eventsIterator: Iterator[SearchFlowSearchEvent]): Iterator[SearchFlow] = {

    val sortedEventsIterator = eventsIterator.toList.sortBy(_.evTimeMillis).iterator


    val searchFlowsEvents: mutable.MutableList[mutable.MutableList[SearchFlowSearchEvent]] = mutable.MutableList()
    var currentSearchFlowEvents: mutable.MutableList[SearchFlowSearchEvent] = mutable.MutableList()
    var previousEvent: SearchFlowSearchEvent = null
    while (sortedEventsIterator.hasNext) {
      val currentEvent = sortedEventsIterator.next()

      if (isSameFlow(previousEvent, currentEvent)) {
        currentSearchFlowEvents += currentEvent
      } else {
        currentSearchFlowEvents = mutable.MutableList()
        currentSearchFlowEvents += currentEvent
        searchFlowsEvents += currentSearchFlowEvents
      }

      previousEvent = currentEvent
    }


    searchFlowsEvents
      .map(searchFlowEvents => model.SearchFlow(searchFlowEvents.toList))
      .iterator
  }

The approach of performing the grouping of the events listed above is iterative (I'm coming from the Java world).

Can anybody provide me some hints on how to achieve the same results in a functional fashion.

0

2 Answers 2

3

This is the kind of thing, you want to use tail recursion for:

        @tailrec 
        def groupEvents(
          in: Iterator[SearchFlowSearchEvent],
          out: List[List[SearchFlowSearchEvent]] = Nil
        ): List[List[SearchFlowSearchEvent]] = if (in.hasNext) {
          val next = in.next
          out match {
            case Nil => groupEvents(in, List(List(next)))
            case (head :: tail) :: rest if isSameFlow(head, next) => groupEvents(in, (next :: head :: tail) :: rest)
            case rest => groupEvents(in, List(next) :: rest)
          }
       } else out.map(_.reverse).reverse 

out contains the groups collected so far (in reverse order - see below). If it is empty, just start a new one. Otherwise look at the first element (last group), and check the first element there (last event). If flow is the same, add current event to this group, otherwise add a new group. Repeat.

In the end (if iterator is empty), reverse the lists, and create the flows.

It is common in scala to assemble lists in reverse order in cases like this. This is because appending to the end of the linked list (or looking at the last element) takes linear time, which would make the entire operation quadratic. Instead, we always prepend (constant time), and then reverse at the very end (linear).

Alternatively, you could write the same thing with foldLeft, but personally, I find a recursive implementation a bit clearer in this case, albeit a little longer (functionally, they are equivalent):

    in.foldLeft[List[List[SearchFlowSearchEvent]]](Nil) {
       case (Nil, next) => List(List(next))
       case ((head :: tail) :: rest, next) if isSameFlow(head, next) => 
          (next :: head :: tail) :: rest
       case (rest, next) => List(next) :: rest
    }.map { l => SearchFlow(l.reverse) }.reverse

UPDATE To address performance concerns, raised in the comments to the other post. I benchmarked the three solutions on a MacBook Pro, Mac OS 10.13.5, 2.9 GHz i7, 16G of RAM with scala 2.11.11 (default REPL settings).

The input was 100000 events, that get collapsed into 14551 groups. I ran each implementation about 500 times after warm up, and took the average time of all executions.

The original implementation took about 42ms per run. Recursive algorithm takes about 28ms FoldLeft was about 29ms

Simply sorting the array of events and converting it to iterator took about 20ms.

I hope this settles the argument of whether procedural approach will always yield better performance than functional. There is a way to speed this implementation up by making specific changes and tradeoffs, but simply replacing recursion with a loop or switching to using mutable containers is not an optimization.

Sign up to request clarification or add additional context in comments.

2 Comments

case ((head :: tail :: rest), next) if isSameFlow(head, next) => (next :: head :: tail) :: rest should be replaced with case ((head :: tail) :: rest, next) if isSameFlow(head, next) => (next :: head :: tail) :: rest to avoid compile issue.
i find both of your solutions very elegant. Thanks for the help on this one. I went for the foldLeft solution because it seems to me easier to grasp. (I'll still need to thoroughly document it for helping the maintainers of my code).
-1

As far as I know, there is no simple built-in solution for this in the collection library. As @Dima said, you should use recursion for this.

Note that if you care a lot about the performances, your initial solution using var and mutable collections will probably be the fastest. Mutability is fine as long as you have a good reason for it and as long as the mutation stays local to a particular method.

To make myself very clear, I am NOT encouraging you to micro-optimize it unless you have a benchmark showing that this helps the performances of your application in a non-negligible way.

7 Comments

The queue reverses everything on access. You are not avoiding any cost this way. WHATS WORTH is that queue.last is linear, so, what you are doing is O(N^2). A way to avoid cost would be to use a preallocated array instead of a list. But that is unlikely to make any significant difference in performance in practice. It's still O(N) no matter how you slice it. And no, the original implementation is NOT "better performance". There is no "good reason" for mutability here.
Regarding performance, I've seen orders of magnitude improvements using loops and vars rather than recursive calls and immutable structure. I don't recommend doing it that way all the time, and I haven't benchmarked this particular case, but it does happen, and performances might be a good reason depending on the exact case. FP is all good for 99% of cases though.
Amortized, yes. If you use the queue to do a lot of enqueue/dequeue pairs, then on average, it will be constant time. But if you just do a lot of enqueues, and then dequeue everything once, it will reverse in between. Note that the "amortized constant time" remark does not apply to .last. So, your implementation is actually quadratic, not linear. Yes, it is possible to come up with a case, where loops and vars are faster. It is wrong to suggest such optimizations prematurely though without actual problem to solve.
@Dima Yes, you're right on this, my bad. I will edit to only keep the comment about performances. I agree that immutability should be the default, however it's not wrong to suggest that this might be a concern. Read my answer again if you think I said that the mutable version was definitely better. Everytime I've ran a benchmark for this, the mutable version came out on top, which makes complete sense on the JVM. That doesn't mean I write mutable code everywhere, because I value my debugging time more than I value the few milliseconds that micro-optimizing will save the CPU.
Ok, I don't like feeding trolls too much so this will be my last message. Premature optimisation applies to optimising without benchmarking first. My answer says the mutable version will probably be faster, which is true based on my experience, whether you like it or not. Again, I'm just adding this for completeness. My comments then restate that this is rarely necessary. Your answer should be accepted and I even voted it up. Just remember that in programming like in life, there are often trade-offs, and it's not as black-and-white as you want it to be.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.