4

I am interested in using Scala because it seems like a good way to parallelize operations. I need to design a machine learning algorithm that makes use of vector multiplication (as do many). I know how to do the algorithm, but what I want to make is a sparse vector implementation from HashMaps. Pretty much all vectors are stored as HashMaps[Int, Double] where the index of a given double in a vector is the integer that is the key.

Using a Pythonish psuedocode, <7, 6, 5, 4> ==> {1:7, 2:6, 3:5, 4:4}

I want to define a dot product function using either fold, reduce, map ... etc., but I DO NOT want to use the foldLeft, reduceLeft ... because I would like this to be potentially parallizable because my vectors can get up to 6000+ dimensions and, for dot products, order does not matter.

I have read MANY examples of foldLeft and reduceLeft, but I have yet to find out how to use HashMap.fold or HashMap.reduce.

I understand functional programming to a decent extent, but I don't understand the error messages in Scala. Here is a template of what I want more or less.

object NGramAnalysis {
  def main(args: Array[String]) {
    val mapped = HashMap(1->1.2, 5->2.4)
    println(mapped.fold( .... What goes here ... )
  }
}

Conclusion I want a solid example of using HashMap.fold NOT foldLeft and the same for HashMap.reduce

Thank you in advance. I have been struggling for a while.

2 Answers 2

3

First of all, the difference between fold and reduce in that fold takes an additional argument which is used as initial value, while reduce takes the first element in the collection as such initial value, and if the collection is empty it throws an exception. So, fold is somewhat more general than reduce, so I will refer to both of these functions as fold from now on.

For fold to work correctly the elements in your collection have to form semigroup, that is, there should be a binary operation which also have to be associative, that is, the following identity should hold: (a `op` b) `op` c == a `op` (b `op` c). Associativity is needed because fold does not specify operation application order, which is especially important in parallel context. This operation is used to perform the fold:

a1 `op` a2 `op` a3 `op` ... `op` an

If reduce is run in parallel, it can split the collection and reduce first half in one thread and second half in another thread; then their results are joined using the same operation. This will work correctly only if the operation is associative.

As I already have said, fold method takes two arguments: initial value and an [associative] binary operator. For example, to concatenate a list of strings in parallel you could do this:

val strings = Seq("a", "b", "c", "d", ...)
strings.par.fold("")(_ ++ _)  // or strings.par.reduce(_ ++ _) if you know that strings is non-empty

So, to implement dot product you need to think of the collection you will be folding/reducing and a binary operator which will perform this reduction.

This is trivial implementation of dot product for two collections:

(c1 zip c2).par.map {
  case (e1, e2) => e1 * e2
}.reduce(_ + _)

That is, we zip these collections together to multiply their elements pairwise using * operator and then we reduce the result using + operator. Of course, * and + have to be defined on elements of c1 and c2.

However, HashMap is not ordered, so its iteration order is undefined. There is no guarantee that zip will join elements with the same key, which makes the above idea of dot product incorrect. You need to do something like this:

c1.par.map {
  case (k, v) => v * c2(k)
}.reduce(_ + _)

Here we are not zipping the collection but instead we perform lookup in the second map using all keys from the first map.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for responding so quickly. Here is some more code in case anyone is trying to figure these out too.
val mm = HashMap(1 -> 1.2, 2 -> 3.3, 5 -> 4.0); mm.withDefaultValue(0.0); val v = mm.map(i => (i._1, mm(i._1) * 5)); val vv = v.fold((10, 10.0))((a, b) => (a._1 + b._1, a._2 + b._2)); println(v); println(vv);
0

I just want to add a simple example implementation, since @Vladimir Matveev covered the background.

Here the vector is backed on a HashMap. The apply factory method ensures that there is a default value 0 for all not specified indices.

The idea is pretty simple. We merge the keysets so that we have all keys specified in either map. Then we multiply the corresponding values and sum it up. Since we have a default value for non-existing keys, this works fine.

class SparseVector private (val entries: Map[Int, Double]) {

  def **(vec: SparseVector) = 
    (entries.keySet ++ vec.entries.keySet).par
      .map(index => entries(index) * vec.entries(index)).sum

  //alternative suggested by @wingedsubmariner
  def **(vec: SparseVector) = 
    (entries.keySet ++ vec.entries.keySet).par
     .aggregate(0.0)((sum, index) => sum + entries(index) * vec.entries(index), (_ + _))
}

object SparseVector {
  def apply(entries: HashMap[Int, Double]) =
    new SparseVector(entries.withDefaultValue(0.0))
}

The methods map, sum and aggregate all have a parallel implementation.

3 Comments

.aggregate(0)((sum, index) => sum + entries(index) * vec.entries(index))(_ + _) might be a little more efficient because it can combine the map and sum operations.
Interesting. I have never used that function. It also puzzles me, because it is implemented in TraverseableOnce like this: def aggregate[B](z: =>B)(seqop: (B, A) => B, combop: (B, B) => B): B = foldLeft(z)(seqop). The parameter combop is not even used and it's just a call to foldLeft. I guess it is designed to be used for parallel stuff.
Yeah, aggregate() is the correct thing. I was thinking of recent Java 8 mutable reduction when I was writing my answer and I remembered that there was exact analogue for Java Stream.reduce() method in Scala, but I forgot what it is called exactly. aggregate() it is, actually; it really should be used for this task if parallelization is needed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.