0

I'm interested in apache SPARK.

I tried to ascending sort a multiple array of SPARK RDD by any column in scala.

(i.e. RDD[Array[Int] -> Array(Array(1,2,3), Array(2,3,4), Array(1,2,1))

If I sort by first column, then result will be Array(Array(1,2,3), Array(1,2,1), Array(2,3,4)). or If I sort by third column, then result will be Array(Array(1,2,3), Array(1,2,3), Array(2,3,4)). ) and then, I want to get RDD[Array[Int]] return-type value. Is there a method to solve it, whether using map() or filter() function?

2 Answers 2

2

Use RDD.sortBy:

// sorting by second column (index = 1)
val result: RDD[Array[Int]] = rdd.sortBy(_(1), ascending = true)

The sorting function can also be written using Pattern Matching:

val result: RDD[Array[Int]] = rdd.sortBy( {
  case Array(a, b, c) => b /* choose column(s) to sort by */
}, ascending = true)

Also note the ascending argument's default value is true, so you can drop it and get the same result:

val result: RDD[Array[Int]] = rdd.sortBy(_(1))
Sign up to request clarification or add additional context in comments.

2 Comments

It works. Thank you, but could you answer one more question if you don't mind? If I insert 4 or more dimension arrays, then I have to command by typing different case-statement.
Never Mind about my last comment. Thank you~~
0
val baseRdd = sc.parallelize(Array(Array(1, 2, 3), Array(2, 3, 4), Array(1, 2, 1)))

//False specifies desending order 
val result = baseRdd.sortBy(x => x(1), false)

result.foreach { x => println(x(0) + "\t" + x(1) + "\t" + x(2)) }

Result

2 3 4

1 2 3

1 2 1

6 Comments

Could you answer one more question if you don't mind? I want to make a new RDD array with some elements extracted from original RDD array. (i.e. Here is a RDD[Array[Int]] -> Array(Array(1,2,3), Array(2,3,4), Array(1,2,1). And I want to make a new RDD array from original RDD array -> Array(Array(1,2,3), Array(1,2,1)) like this.) Is there a method to solve it?
Yes you can do that. one quick question. on what basis you want to extract data from ur original rdd ? that would help me to give u precise answer.
original rdd array means that RDD[Array[Int]] -> Array(Array(1,2,3), Array(2,3,4), Array(1,2,1)). As I say, I wanna make a new rdd array (Array(Array(1,2,3), Array(1,2,1)) from RDD[Array[Int]] -> Array(Array(1,2,3), Array(2,3,4), Array(1,2,1)).` It's difficult to me...
you need to apply filter() method. filter method will return true or false. I'm not sure what are your parameters to filter it. but giving one sample ..
val baseRdd = sc.parallelize(Array(Array(1, 2, 3), Array(2, 3, 4), Array(1, 2, 1))) val resultRDD = baseRdd.filter { x => x(1).!=(3) } resultRDD.foreach { x => println(x(0) + "\t" + x(1) + "\t" + x(2)) }
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.