1

So I am fairly new to functional programming and Spark and Scala so forgive me if this is obvious... But basically I have a list of files through out HDFS that meet certain criteria, ie something like this:

    val List = (
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=01/000140_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=03/000258_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=05/000270_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=01/000297_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=30/000300_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=01/000362_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=29/000365_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=01/000397_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=15/000436_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=16/000447_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=01/000529_0",
"hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/partday=17/000585_0" )

I now need to build up an RDD to work with from this list... My thought was to use a recursive Union... Basically a function something like:

def dostuff(line: String): (org.apache.spark.rdd.RDD[String]) = {
      val x = sc.textFile(line)
      val x:org.apache.spark.rdd.RDD[String] = sc.textFile(x) ++ sc.textFile(line)
}

Then simply apply it through a map:

val RDD_list = List.map(l => l.dostuff)

1 Answer 1

4

You can read all the files into a single RDD like this:

val sc = new SparkContext(...)
sc.textFile("hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/*/*")
  .map(line => ...)
Sign up to request clarification or add additional context in comments.

2 Comments

Brilliant! Thanks! Should have thought of that... Follow up question though... So I have something like this now:
What do you do if the list has arbitrary filenames, and there's no reasonable equivalent of your "hdfs:///hive/some.db/BigAssHiveTable/partyear=2014/partmonth=06/*/*" pattern?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.