5

If I have an array of array (similar to a matrix) in Scala, what's the efficient way to sum up each column of the matrix? For example, if my array of array is like below:

val arr =  Array(Array(1, 100, ...), Array(2, 200, ...), Array(3, 300, ...))

and I want to sum up each column (e.g., sum up the first element of all sub-arrays, sum up the second element of all sub-arrays, etc.) and get a new array like below:

newArr = Array(6, 600, ...)

How can I do this efficiently in Spark Scala?

4 Answers 4

6

There is a suitable .transpose method on List that can help here, although I can't say what its efficiency is like:

arr.toList.transpose.map(_.sum)

(then call .toArray if you specifically need the result as an array).

Sign up to request clarification or add additional context in comments.

6 Comments

Also works without toList: arr.transpose.map(_.sum) #=> res24: Array[Int] = Array(6, 600)
Apparently requires all input arrays to be the same length, I get an error otherwise.
@MichaelKohl So it does, I must have made a mistake when I tried it directly on the initial array.
@Paul Yes, it specifically warns about that in the (expanded) docs for List - doesn't seem to mention it in the docs for Array, though.
The mention of Spark makes me think the array can be quite big. Transposing that isn't probably the best approach
|
4

Using breeze Vector:

scala> val arr =  Array(Array(1, 100), Array(2, 200), Array(3, 300))
arr: Array[Array[Int]] = Array(Array(1, 100), Array(2, 200), Array(3, 300))

scala> arr.map(breeze.linalg.Vector(_)).reduce(_ + _)
res0: breeze.linalg.Vector[Int] = DenseVector(6, 600)

If your input is sparse you may consider using breeze.linalg.SparseVector.

1 Comment

github.com/scalanlp/breeze/wiki/Breeze-Linear-Algebra has useful reference functions for this sort of thing, though it assumes you're using Scala / Breeze data types (e.g. DenseMatrix) instead of an Array of Arrays.
4

In practice a linear algebra vector library as mentioned by @zero323 will often be the better choice.

If you can't use a vector library, I suggest writing a function col2sum that can sum two columns -- even if they are not the same length -- and then use Array.reduce to extend this operation to N columns. Using reduce is valid because we know that sums are not dependent on order of operations (i.e. 1+2+3 == 3+2+1 == 3+1+2 == 6) :

def col2sum(x:Array[Int],y:Array[Int]):Array[Int] = {
    x.zipAll(y,0,0).map(pair=>pair._1+pair._2)
}

def colsum(a:Array[Array[Int]]):Array[Int] = {
    a.reduce(col2sum)
}

val z = Array(Array(1, 2, 3, 4, 5), Array(2, 4, 6, 8, 10), Array(1, 9));

colsum(z)

--> Array[Int] = Array(4, 15, 9, 12, 15)

3 Comments

Nicely done, especially as it can be expressed quite succinctly: arr.reduce(_.zipAll(_,0,0).map{case (a,b) => a+b})
@jwh Do you need the case statment, as there is only one case?
Well there are a number of ways to sum a Tuple2 (Int,Int). Your code breaks the received argument, a tuple, into its _1 _2 parts, which is good and straight forward. I chose the "partial function" method, which requires braces {} and the case statement, because I find it a little less visually cluttered. The case is needed to invoke pattern matching, which then assigns each part to the designated variables (a and b).
0
scala> val arr =  Array(Array(1, 100), Array(2, 200), Array(3, 300 ))
arr: Array[Array[Int]] = Array(Array(1, 100), Array(2, 200), Array(3, 300))

scala> arr.flatten.zipWithIndex.groupBy(c => (c._2 + 1) % 2)
       .map(a => a._1 -> a._2.foldLeft(0)((sum, i) => sum + i._1))

res40: scala.collection.immutable.Map[Int,Int] = Map(2 -> 600, 1 -> 6, 0 -> 15)

flatten array and zipWithIndex to get index and groupBy to map new array as column array, foldLeft to sum the column array.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.