Sorting values of an array type in RDD using pySpark

Question

I have an RDD containing values like this:

[
   (Key1, ([2,1,4,3,5],5)),
   (Key2, ([6,4,3,5,2],5)),
   (Key3, ([14,12,13,10,15],5)),
]

and I need to sort the value of the array part just like this:

[
   (Key1, ([1,2,3,4,5],5)),
   (Key2, ([2,3,4,5,6],5)),
   (Key3, ([10,12,13,14,15],5)),
]

I find two sorting methods for Spark: sortBy and sortbyKey. I tried the sortBy method like this:

myRDD.sortBy(lambda x: x[1][0])

But unfortunately, it sort data based on the first element of the array instead of sorting the elements of the array per se.

Also, the sortByKey seems not to help cause it just sorts the data based on the keys.

How can I achieve the sorted RDD?

Ged · Accepted Answer · 2021-11-01 19:57:43Z

2

Try something like this:

rdd2 = rdd.map(lambda x: (x[0], sorted(x[1]), x[2]  ))

edited Nov 1, 2021 at 19:57

answered Nov 1, 2021 at 19:49

Ged

18.5k8 gold badges54 silver badges108 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Mohammad Derakhshan Over a year ago

As I know, sorted() is a python function. So is this sorting process distributed? If we can use these such functions, I also can use the NumPy sort function so Which one is better in terms of performance? @thebluephantom

Ged Over a year ago

U r not sorting distributed here. Just an array in an rdd as element.

Mohammad Derakhshan Over a year ago

So how can I sort elements of the array in a distributed manner? @thebluephantom

Ged Over a year ago

Its a narrow transformation here so it is by default distributed. Believe me pls. U r not sorting the rdd just an element of rdd.

Collectives™ on Stack Overflow

Sorting values of an array type in RDD using pySpark

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related