2

i'm new to spark and don't understand how mapreduce mechanism works with spark. I have one csv file with only doubles, what i want is to make an operation (compute euclidian distance) with the first vector with the rest of the rdd. Then iterate with the other vectors. It is exist a other way than this one ? Maybe use wisely the cartesian product...

val rdd = sc.parallelize(Array((1,Vectors.dense(1,2)),(2,Vectors.dense(3,4),...)))
val array_vects = rdd.collect
val size = rdd.count
val emptyArray = Array((0,Vectors.dense(0))).tail
var rdd_rez = sc.parallelize(emptyArray)

for( ind <- 0 to size -1 ) {
   val vector = array_vects(ind)._2
   val rest = rdd.filter(x => x._1 != ind)
   val rdd_dist = rest.map( x => (x._1 , Vectors.sqdist(x._2,vector)))
   rdd_rez = rdd_rez ++ rdd_dist
}

Thank you for your support.

2 Answers 2

4

The distances (between all pairs of vectors) can be calculated using rdd.cartesian:

val rdd = sc.parallelize(Array((1,Vectors.dense(1,2)),
                               (2,Vectors.dense(3,4)),...))
val product = rdd.cartesian(rdd)

val result = product.filter{ case ((a, b), (c, d)) => a != c }
                    .map   { case ((a, b), (c, d)) => 
                                   (a, Vectors.sqdist(b, d)) }
Sign up to request clarification or add additional context in comments.

Comments

0

I don't think why you were trying to do something like that. you can simply do this as follows.

val initialArray = Array( ( 1,Vectors.dense(1,2) ), ( 2,Vectors.dense(3,4) ),... )

val firstVector = initialArray( 0 )

val initialRdd = sc.parallelize( initialArray )

val euclideanRdd = initialRdd.map( { case ( i, vec ) => ( i, euclidean( firstVector, vec ) ) } )

Where we define a function euclidean which take two dense vectors and returns euclidean distances.

7 Comments

No, "!hen iterate with the other vectors.",. That's after doing the euclidean distance witth first vector vs. all the others)
What do yo mean by that.... do you mean you want the euclidean distance of every vector from every other vector ?
Well... I wrote that first... but changed after I thought that you want it for first vector only. But still... if you want that... I think @Shyamendra has provided the same thing.
I'm not the OP. But the code the OP provides loops over all rows and computes the euclidiena with all other elements for each.
Well... As I said I first wrote the answer for cartesian... But then someone commented saying otherwise... I thought he is Op... Changed the answer. And if the Op wants cartesian then the other answer will be perfect for him.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.