4

I'm practicing on doing sorts in the Spark shell. I have an rdd with about 10 columns/variables. I want to sort the whole rdd on the values of column 7.

rdd
org.apache.spark.rdd.RDD[Array[String]] = ...

From what I gather the way to do that is by using sortByKey, which in turn only works on pairs. So I mapped it so I'd have a pair consisting of column7 (string values) and the full original rdd (array of strings)

rdd2 = rdd.map(c => (c(7),c))
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])] = ...

I then apply sortByKey, still no problem...

rdd3 = rdd2.sortByKey()
rdd3: org.apache.spark.rdd.RDD[(String, Array[String])] = ...

But now how do I split off, collect and save that sorted original rdd from rdd3 (Array[String])? Whenever I try a split on rdd3 it gives me an error:

val rdd4 = rdd3.map(_.split(',')(2))
<console>:33: error: value split is not a member of (String, Array[String])

What am I doing wrong here? Are there other, better ways to sort an rdd on one of its columns?

3
  • 1
    I don't get you want exactly. do you mean that you want to split each String inside Array[String]? Commented Apr 26, 2016 at 8:57
  • 2
    You tried to split Tuple, that's why it's error Commented Apr 26, 2016 at 9:02
  • 1
    @John No I want to split rdd3 (a sorted pair of column7 and the original rdd), so I'd have my original rdd back but still sorted on column 7... without actually having column 7 prefixed to it (like in rdd3). I edited the question slightly, is it clearer now? Commented Apr 26, 2016 at 9:04

4 Answers 4

2

what you did with rdd2 = rdd.map(c => (c(7),c)) is to map it to a tuple. rdd2: org.apache.spark.rdd.RDD[(String, Array[String])] exactly as it says :). now if you want to split the record you need to get it from this tuple. you can map again, taking only the second part of the tuple (which is the array of Array[String]...) like so : rdd3.map(_._2)

but i would strongly suggest to use try rdd.sortBy(_(7)) or something of this sort. this way you do not need to bother yourself with tuple and such.

Sign up to request clarification or add additional context in comments.

5 Comments

I tried your suggestion of rdd.sortBy(.=>.7), but that puts out "error: identifier expected but '=>' found". Can you edit that so I can accept your answer? As you suggest, rdd3.map(_._2) does the job as well but requires a bit more work.
.sortBy(c => c._7) won't work as well as .sortBy(_._7) since the elements in the rdd have an Array structure. @KoenDeCouck, i've posted my answer. you might want to check it out. :)
It didn't, however @John Titus Jungao 's answer has the solution: rdd.sortBy(_(7)). I'll accept this answer since the question focussed on the split after all and you gave some info on why that didn't work.
yes of course. now I see my mistake. with @JohnTitusJungao permission I can edit it for future reference.
@ZahiroMor, sure! go ahead. glad that helped. :)
2

if you want to sort the rdd using the 7th string in the array, you can just do it directly by

rdd.sortBy(_(6)) // array starts at 0 not 1

or

rdd.sortBy(arr => arr(6))

That will save you all the hassle of doing multiple transformations. The reason why rdd.sortBy(_._7) or rdd.sortBy(x => x._7) won't work is because that's not how you access an element inside an Array. To access the 7th element of an array, say arr, you should do arr(6).

To test this, i did the following:

val rdd = sc.parallelize(Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd")))

// I want to sort it using the 3rd String
val sorted_rdd = rdd.sortBy(_(2))

Here's the result:

Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd"))

1 Comment

Thank you John! This solution looks like the better way to sort. I'll accept Zahiro's answer however because of the way the question was phrased, appended with your solution. (Upvoted this)
1

just do this:

val rdd4 = rdd3.map(_._2)

Comments

0

I thought you don't familiar with Scala, So, below should help you understand more,

rdd3.map(kv => {
  println(kv._1) // This represent String 
  println(kv._2) // This represent Array[String]
})

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.