0

I am sifting through a large data set, parsing and grouping based on same keys. But to use groupBy function I need to convert my iterator to an Array. Why is groupBy not present in Iterator? I understand how an iterator works and that an iterator can iterate through the elements only once. But when you provide methods like map, filter, foreach etc on Iterator why not provide groupBy as well?
Is there any specific reason for this? Because converting an iterator to an Array takes more time when you work with large data.

5
  • You don't have to convert to Array. You might do .toStream instead. Then it has a groupBy and it's still lazy. That is, it's lazy until you invoke the groupBy, which will force evaluation, which makes sense because you can't know if any particular group is complete unless you inspect the entire original collection, which would exhaust an iterator. Commented Oct 13, 2018 at 5:07
  • @jwvh : toStream() does not make any difference. It has the same performance as with array. Commented Oct 13, 2018 at 7:06
  • Yes and no. itr.toStream is much faster than itr.toArray (try it on an infinite Iterator) but, as I indicated previously, itr.toStream.groupBy() won't be better than itr.toArray.groupBy(), which wouldn't be any better than itr.groupBy() (if there were such a thing), because they all load the entire iterator contents into memory. Commented Oct 13, 2018 at 7:49
  • @jwvh : Why do you say that iter.toStream is much fater thatn itr.toArray ? Commented Oct 13, 2018 at 10:34
  • Because .toArray has to realize every element of the iterator and load it all into memory. .toStream doesn't. Commented Oct 13, 2018 at 15:27

1 Answer 1

2

One approach to avoid loading the entire dataset into an Array or List from an Iterator is to use foldLeft to assemble the aggregated Map. Below is an example of computing the sum of values by key via foldLeft from an Iterator:

val it = Iterator(("a", 1), ("a", 2), ("b", 3), ("b", 4), ("c", 5))

it.foldLeft(Map.empty[String, Int]){ case (m, (k, v)) =>
  m + (k -> (m.getOrElse(k, 0) + v))
}
// res1: scala.collection.immutable.Map[String,Int] = Map(a -> 3, b -> 7, c -> 5)

Re: problem with groupBy on an Iterator, here's a relevant SO link and Scala-lang link.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.