How do I split a Spark rdd Array[(String, Array[String])]?

Question

I'm practicing on doing sorts in the Spark shell. I have an rdd with about 10 columns/variables. I want to sort the whole rdd on the values of column 7.

rdd
org.apache.spark.rdd.RDD[Array[String]] = ...

From what I gather the way to do that is by using sortByKey, which in turn only works on pairs. So I mapped it so I'd have a pair consisting of column7 (string values) and the full original rdd (array of strings)

rdd2 = rdd.map(c => (c(7),c))
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])] = ...

I then apply sortByKey, still no problem...

rdd3 = rdd2.sortByKey()
rdd3: org.apache.spark.rdd.RDD[(String, Array[String])] = ...

But now how do I split off, collect and save that sorted original rdd from rdd3 (Array[String])? Whenever I try a split on rdd3 it gives me an error:

val rdd4 = rdd3.map(_.split(',')(2))
<console>:33: error: value split is not a member of (String, Array[String])

What am I doing wrong here? Are there other, better ways to sort an rdd on one of its columns?

I don't get you want exactly. do you mean that you want to split each String inside Array[String]? — jtitusj
– jtitusj, Commented Apr 26, 2016 at 8:57
@John No I want to split rdd3 (a sorted pair of column7 and the original rdd), so I'd have my original rdd back but still sorted on column 7... without actually having column 7 prefixed to it (like in rdd3). I edited the question slightly, is it clearer now? — AOm
– AOm, Commented Apr 26, 2016 at 9:04

Zahiro Mor · Accepted Answer · 2016-04-26 12:13:40Z

2

what you did with rdd2 = rdd.map(c => (c(7),c)) is to map it to a tuple. rdd2: org.apache.spark.rdd.RDD[(String, Array[String])] exactly as it says :). now if you want to split the record you need to get it from this tuple. you can map again, taking only the second part of the tuple (which is the array of Array[String]...) like so : rdd3.map(_._2)

but i would strongly suggest to use try rdd.sortBy(_(7)) or something of this sort. this way you do not need to bother yourself with tuple and such.

edited Apr 26, 2016 at 12:13

answered Apr 26, 2016 at 10:03

Zahiro Mor

1,7181 gold badge17 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

AOm Over a year ago

I tried your suggestion of rdd.sortBy(.=>.7), but that puts out "error: identifier expected but '=>' found". Can you edit that so I can accept your answer? As you suggest, rdd3.map(_._2) does the job as well but requires a bit more work.

jtitusj Over a year ago

.sortBy(c => c._7) won't work as well as .sortBy(_._7) since the elements in the rdd have an Array structure. @KoenDeCouck, i've posted my answer. you might want to check it out. :)

AOm Over a year ago

It didn't, however @John Titus Jungao 's answer has the solution: rdd.sortBy(_(7)). I'll accept this answer since the question focussed on the split after all and you gave some info on why that didn't work.

Zahiro Mor Over a year ago

yes of course. now I see my mistake. with @JohnTitusJungao permission I can edit it for future reference.

jtitusj Over a year ago

@ZahiroMor, sure! go ahead. glad that helped. :)

jtitusj · Accepted Answer · 2016-04-26 11:45:27Z

2

if you want to sort the rdd using the 7th string in the array, you can just do it directly by

rdd.sortBy(_(6)) // array starts at 0 not 1

or

rdd.sortBy(arr => arr(6))

That will save you all the hassle of doing multiple transformations. The reason why rdd.sortBy(_._7) or rdd.sortBy(x => x._7) won't work is because that's not how you access an element inside an Array. To access the 7th element of an array, say arr, you should do arr(6).

To test this, i did the following:

val rdd = sc.parallelize(Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd")))

// I want to sort it using the 3rd String
val sorted_rdd = rdd.sortBy(_(2))

Here's the result:

Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd"))

edited Apr 26, 2016 at 11:45

answered Apr 26, 2016 at 10:36

jtitusj

3,1043 gold badges26 silver badges40 bronze badges

1 Comment

AOm Over a year ago

Thank you John! This solution looks like the better way to sort. I'll accept Zahiro's answer however because of the way the question was phrased, appended with your solution. (Upvoted this)

Hlib · Accepted Answer · 2016-04-26 09:23:57Z

1

just do this:

val rdd4 = rdd3.map(_._2)

answered Apr 26, 2016 at 9:23

Hlib

3,0846 gold badges32 silver badges34 bronze badges

Comments

Peerapat A · Accepted Answer · 2016-04-26 09:09:51Z

0

I thought you don't familiar with Scala, So, below should help you understand more,

rdd3.map(kv => {
  println(kv._1) // This represent String 
  println(kv._2) // This represent Array[String]
})

answered Apr 26, 2016 at 9:09

Peerapat A

4304 silver badges13 bronze badges

Collectives™ on Stack Overflow

How do I split a Spark rdd Array[(String, Array[String])]?

4 Answers 4

5 Comments

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related