How to avoid for loop with Spark?

Question

i'm new to spark and don't understand how mapreduce mechanism works with spark. I have one csv file with only doubles, what i want is to make an operation (compute euclidian distance) with the first vector with the rest of the rdd. Then iterate with the other vectors. It is exist a other way than this one ? Maybe use wisely the cartesian product...

val rdd = sc.parallelize(Array((1,Vectors.dense(1,2)),(2,Vectors.dense(3,4),...)))
val array_vects = rdd.collect
val size = rdd.count
val emptyArray = Array((0,Vectors.dense(0))).tail
var rdd_rez = sc.parallelize(emptyArray)

for( ind <- 0 to size -1 ) {
   val vector = array_vects(ind)._2
   val rest = rdd.filter(x => x._1 != ind)
   val rdd_dist = rest.map( x => (x._1 , Vectors.sqdist(x._2,vector)))
   rdd_rez = rdd_rez ++ rdd_dist
}

Thank you for your support.

Shyamendra Solanki · Accepted Answer · 2015-04-17 13:12:08Z

4

The distances (between all pairs of vectors) can be calculated using rdd.cartesian:

val rdd = sc.parallelize(Array((1,Vectors.dense(1,2)),
                               (2,Vectors.dense(3,4)),...))
val product = rdd.cartesian(rdd)

val result = product.filter{ case ((a, b), (c, d)) => a != c }
                    .map   { case ((a, b), (c, d)) => 
                                   (a, Vectors.sqdist(b, d)) }

answered Apr 17, 2015 at 13:12

Shyamendra Solanki

8,8512 gold badges33 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

sarveshseri · Accepted Answer · 2015-04-17 12:44:59Z

0

I don't think why you were trying to do something like that. you can simply do this as follows.

val initialArray = Array( ( 1,Vectors.dense(1,2) ), ( 2,Vectors.dense(3,4) ),... )

val firstVector = initialArray( 0 )

val initialRdd = sc.parallelize( initialArray )

val euclideanRdd = initialRdd.map( { case ( i, vec ) => ( i, euclidean( firstVector, vec ) ) } )

Where we define a function euclidean which take two dense vectors and returns euclidean distances.

edited Apr 17, 2015 at 12:44

answered Apr 17, 2015 at 12:02

sarveshseri

14k30 silver badges50 bronze badges

7 Comments

The Archetypal Paul Over a year ago

No, "!hen iterate with the other vectors.",. That's after doing the euclidean distance witth first vector vs. all the others)

sarveshseri Over a year ago

What do yo mean by that.... do you mean you want the euclidean distance of every vector from every other vector ?

sarveshseri Over a year ago

Well... I wrote that first... but changed after I thought that you want it for first vector only. But still... if you want that... I think @Shyamendra has provided the same thing.

The Archetypal Paul Over a year ago

I'm not the OP. But the code the OP provides loops over all rows and computes the euclidiena with all other elements for each.

sarveshseri Over a year ago

Well... As I said I first wrote the answer for cartesian... But then someone commented saying otherwise... I thought he is Op... Changed the answer. And if the Op wants cartesian then the other answer will be perfect for him.

|

Collectives™ on Stack Overflow

How to avoid for loop with Spark?

2 Answers 2

Comments

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related