How to concatenate RDD[String] with Array[String] to produce String?

Question

How can I convert RDD[String] and Array[String] to String?

I am getting the below error,

<console>:34: error: type mismatch;
found   : org.apache.spark.rdd.RDD[String]
required: String

The idea is to get the distinct date from a column in SchemaRDD and concat the date with a constant String as /home/tmp/date=. So I should concat both and output should be

val path =  /home/tmp/date=20140901,/home/tmp/date=20140902,/home/tmp/date=20140903,/home/tmp/date=20140904,... so on

path will be keyed in sc.textFiles(path) to read the entire dataset.

At this step, while reading the data I get the conversion error.

Spiro Michaylov · Accepted Answer · 2014-09-26 13:39:30Z

Hare's one approach. First, set up the example:

val prefix = "/home/tmp/date="
val dates =  Array("20140901", "20140902", "20140903", "20140904")
val datesRDD = sc.parallelize(dates, 2)

Zipping the the prefix in is easy:

val datesWithPrefixRDD = datesRDD.map(s => prefix + s)
datesWithPrefixRDD.foreach(println)

This produces:

/home/tmp/date=20140901
/home/tmp/date=20140903
/home/tmp/date=20140902
/home/tmp/date=20140904

But you asked for a single string. The obvious first attempt has some comma problems:

val bad = datesWithPrefixRDD.fold("")((s1, s2) => s1 + ", " + s2)
println(bad)

This produces:

, , /home/tmp/date=20140901, /home/tmp/date=20140902, , /home/tmp/date=20140903, /home/tmp/date=20140904

The problem is the way Spark RDD's fold() method starts the concatenation with the empty string I provided, once for the whole RDD and once for each partition. But we can deal with empty strings:

val good = datesWithPrefixRDD.fold("")((s1, s2) =>
  s1 match {
    case "" => s2
    case s => s + ", " + s2
  })
println(good)

Then we get:

/home/tmp/date=20140901, /home/tmp/date=20140902, /home/tmp/date=20140903, /home/tmp/date=20140904

EDIT: Actually, reduce() produces a tidier answer because it solves the "extra comma" problems:

val alternative = datesWithPrefixRDD.reduce((s1, s2) => s1 + ", " + s2)
println(alternative)

Again we get:

/home/tmp/date=20140901, /home/tmp/date=20140902, /home/tmp/date=20140903, /home/tmp/date=20140904

Collectives™ on Stack Overflow

How to concatenate RDD[String] with Array[String] to produce String?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related