Building up an RDD with a recursive Union in Scala within Spark

Question

So I am fairly new to functional programming and Spark and Scala so forgive me if this is obvious... But basically I have a list of files through out HDFS that meet certain criteria, ie something like this:

    val List = (
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=01/000140_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=03/000258_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=05/000270_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=01/000297_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=30/000300_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=01/000362_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=29/000365_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=01/000397_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=15/000436_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=16/000447_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=01/000529_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=17/000585_0" )

I now need to build up an RDD to work with from this list... My thought was to use a recursive Union... Basically a function something like:

def dostuff(line: String): (org.apache.spark.rdd.RDD[String]) = {
      val x = sc.textFile(line)
      val x:org.apache.spark.rdd.RDD[String] = sc.textFile(x) ++ sc.textFile(line)
}

Then simply apply it through a map:

val RDD_list = List.map(l => l.dostuff)

Eric Eijkelenboom · Accepted Answer · 2014-07-24 18:56:44Z

4

You can read all the files into a single RDD like this:

val sc = new SparkContext(...)
sc.textFile("hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/*/*")
  .map(line => ...)

answered Jul 24, 2014 at 18:56

Eric Eijkelenboom

7,0212 gold badges27 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Category_Theory Over a year ago

Brilliant! Thanks! Should have thought of that... Follow up question though... So I have something like this now:

James Moore Over a year ago

What do you do if the list has arbitrary filenames, and there's no reasonable equivalent of your "hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/*/*" pattern?

Collectives™ on Stack Overflow

Building up an RDD with a recursive Union in Scala within Spark

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related