2

I have an RDD containing values like this:

[
   (Key1, ([2,1,4,3,5],5)),
   (Key2, ([6,4,3,5,2],5)),
   (Key3, ([14,12,13,10,15],5)),
]

and I need to sort the value of the array part just like this:

[
   (Key1, ([1,2,3,4,5],5)),
   (Key2, ([2,3,4,5,6],5)),
   (Key3, ([10,12,13,14,15],5)),
]

I find two sorting methods for Spark: sortBy and sortbyKey. I tried the sortBy method like this:

myRDD.sortBy(lambda x: x[1][0])

But unfortunately, it sort data based on the first element of the array instead of sorting the elements of the array per se.

Also, the sortByKey seems not to help cause it just sorts the data based on the keys.

How can I achieve the sorted RDD?

1 Answer 1

2

Try something like this:

rdd2 = rdd.map(lambda x: (x[0], sorted(x[1]), x[2]  ))
Sign up to request clarification or add additional context in comments.

4 Comments

As I know, sorted() is a python function. So is this sorting process distributed? If we can use these such functions, I also can use the NumPy sort function so Which one is better in terms of performance? @thebluephantom
U r not sorting distributed here. Just an array in an rdd as element.
So how can I sort elements of the array in a distributed manner? @thebluephantom
Its a narrow transformation here so it is by default distributed. Believe me pls. U r not sorting the rdd just an element of rdd.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.