I am sifting through a large data set, parsing and grouping based on same keys. But to use groupBy function I need to convert my iterator to an Array. Why is groupBy not present in Iterator? I understand how an iterator works and that an iterator can iterate through the elements only once. But when you provide methods like map, filter, foreach etc on Iterator why not provide groupBy as well?
Is there any specific reason for this? Because converting an iterator to an Array takes more time when you work with large data.
1 Answer
One approach to avoid loading the entire dataset into an Array or List from an Iterator is to use foldLeft to assemble the aggregated Map. Below is an example of computing the sum of values by key via foldLeft from an Iterator:
val it = Iterator(("a", 1), ("a", 2), ("b", 3), ("b", 4), ("c", 5))
it.foldLeft(Map.empty[String, Int]){ case (m, (k, v)) =>
m + (k -> (m.getOrElse(k, 0) + v))
}
// res1: scala.collection.immutable.Map[String,Int] = Map(a -> 3, b -> 7, c -> 5)
Re: problem with groupBy on an Iterator, here's a relevant SO link and Scala-lang link.
Array. You might do.toStreaminstead. Then it has agroupByand it's still lazy. That is, it's lazy until you invoke thegroupBy, which will force evaluation, which makes sense because you can't know if any particular group is complete unless you inspect the entire original collection, which would exhaust an iterator.itr.toStreamis much faster thanitr.toArray(try it on an infiniteIterator) but, as I indicated previously,itr.toStream.groupBy()won't be better thanitr.toArray.groupBy(), which wouldn't be any better thanitr.groupBy()(if there were such a thing), because they all load the entire iterator contents into memory..toArrayhas to realize every element of the iterator and load it all into memory..toStreamdoesn't.