Scala - Example of Fold Operation on a HashMap ** not foldLeft

Question

I am interested in using Scala because it seems like a good way to parallelize operations. I need to design a machine learning algorithm that makes use of vector multiplication (as do many). I know how to do the algorithm, but what I want to make is a sparse vector implementation from HashMaps. Pretty much all vectors are stored as HashMaps[Int, Double] where the index of a given double in a vector is the integer that is the key.

Using a Pythonish psuedocode, <7, 6, 5, 4> ==> {1:7, 2:6, 3:5, 4:4}

I want to define a dot product function using either fold, reduce, map ... etc., but I DO NOT want to use the foldLeft, reduceLeft ... because I would like this to be potentially parallizable because my vectors can get up to 6000+ dimensions and, for dot products, order does not matter.

I have read MANY examples of foldLeft and reduceLeft, but I have yet to find out how to use HashMap.fold or HashMap.reduce.

I understand functional programming to a decent extent, but I don't understand the error messages in Scala. Here is a template of what I want more or less.

object NGramAnalysis {
  def main(args: Array[String]) {
    val mapped = HashMap(1->1.2, 5->2.4)
    println(mapped.fold( .... What goes here ... )
  }
}

Conclusion I want a solid example of using HashMap.fold NOT foldLeft and the same for HashMap.reduce

Thank you in advance. I have been struggling for a while.

Vladimir Matveev · Accepted Answer · 2014-08-15 09:05:09Z

3

First of all, the difference between fold and reduce in that fold takes an additional argument which is used as initial value, while reduce takes the first element in the collection as such initial value, and if the collection is empty it throws an exception. So, fold is somewhat more general than reduce, so I will refer to both of these functions as fold from now on.

For fold to work correctly the elements in your collection have to form semigroup, that is, there should be a binary operation which also have to be associative, that is, the following identity should hold: (a `op` b) `op` c == a `op` (b `op` c). Associativity is needed because fold does not specify operation application order, which is especially important in parallel context. This operation is used to perform the fold:

a1 `op` a2 `op` a3 `op` ... `op` an

If reduce is run in parallel, it can split the collection and reduce first half in one thread and second half in another thread; then their results are joined using the same operation. This will work correctly only if the operation is associative.

As I already have said, fold method takes two arguments: initial value and an [associative] binary operator. For example, to concatenate a list of strings in parallel you could do this:

val strings = Seq("a", "b", "c", "d", ...)
strings.par.fold("")(_ ++ _)  // or strings.par.reduce(_ ++ _) if you know that strings is non-empty

So, to implement dot product you need to think of the collection you will be folding/reducing and a binary operator which will perform this reduction.

This is trivial implementation of dot product for two collections:

(c1 zip c2).par.map {
  case (e1, e2) => e1 * e2
}.reduce(_ + _)

That is, we zip these collections together to multiply their elements pairwise using * operator and then we reduce the result using + operator. Of course, * and + have to be defined on elements of c1 and c2.

However, HashMap is not ordered, so its iteration order is undefined. There is no guarantee that zip will join elements with the same key, which makes the above idea of dot product incorrect. You need to do something like this:

c1.par.map {
  case (k, v) => v * c2(k)
}.reduce(_ + _)

Here we are not zipping the collection but instead we perform lookup in the second map using all keys from the first map.

edited Aug 15, 2014 at 9:05

answered Aug 15, 2014 at 8:59

Vladimir Matveev

129k35 gold badges310 silver badges308 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user3002273 Over a year ago

Thank you for responding so quickly. Here is some more code in case anyone is trying to figure these out too.

user3002273 Over a year ago

val mm = HashMap(1 -> 1.2, 2 -> 3.3, 5 -> 4.0); mm.withDefaultValue(0.0); val v = mm.map(i => (i._1, mm(i._1) * 5)); val vv = v.fold((10, 10.0))((a, b) => (a._1 + b._1, a._2 + b._2)); println(v); println(vv);

Kigyo · Accepted Answer · 2014-08-15 13:14:39Z

0

I just want to add a simple example implementation, since @Vladimir Matveev covered the background.

Here the vector is backed on a HashMap. The apply factory method ensures that there is a default value 0 for all not specified indices.

The idea is pretty simple. We merge the keysets so that we have all keys specified in either map. Then we multiply the corresponding values and sum it up. Since we have a default value for non-existing keys, this works fine.

class SparseVector private (val entries: Map[Int, Double]) {

  def **(vec: SparseVector) = 
    (entries.keySet ++ vec.entries.keySet).par
      .map(index => entries(index) * vec.entries(index)).sum

  //alternative suggested by @wingedsubmariner
  def **(vec: SparseVector) = 
    (entries.keySet ++ vec.entries.keySet).par
     .aggregate(0.0)((sum, index) => sum + entries(index) * vec.entries(index), (_ + _))
}

object SparseVector {
  def apply(entries: HashMap[Int, Double]) =
    new SparseVector(entries.withDefaultValue(0.0))
}

The methods map, sum and aggregate all have a parallel implementation.

edited Aug 15, 2014 at 13:14

answered Aug 15, 2014 at 10:07

Kigyo

5,7681 gold badge24 silver badges25 bronze badges

3 Comments

wingedsubmariner Over a year ago

.aggregate(0)((sum, index) => sum + entries(index) * vec.entries(index))(_ + _) might be a little more efficient because it can combine the map and sum operations.

Kigyo Over a year ago

Interesting. I have never used that function. It also puzzles me, because it is implemented in TraverseableOnce like this: def aggregate[B](z: =>B)(seqop: (B, A) => B, combop: (B, B) => B): B = foldLeft(z)(seqop). The parameter combop is not even used and it's just a call to foldLeft. I guess it is designed to be used for parallel stuff.

Vladimir Matveev Over a year ago

Yeah, aggregate() is the correct thing. I was thinking of recent Java 8 mutable reduction when I was writing my answer and I remembered that there was exact analogue for Java Stream.reduce() method in Scala, but I forgot what it is called exactly. aggregate() it is, actually; it really should be used for this task if parallelization is needed.

Collectives™ on Stack Overflow

Scala - Example of Fold Operation on a HashMap ** not foldLeft

2 Answers 2

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related