Spark/Scala read hadoop file

Question

In a pig script I saved a table using PigStorage('|'). I have in the corresponding hadoop folder files like

part-r-00000

etc. What is the best way to load it in Spark/Scala ? In this table I have 3 fields: Int, String, Float

I tried:

text = sc.hadoopFile("file", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitions)

But then I would need somehow to split each line. Is there a better way to do it?

If I were coding in python I would create a Dataframe indexed by the first field and whose columns are the values found in the string field and coefficients the float values. But I need to use scala to use the pca module. And the dataframes don't seem that close to python's ones

Thanks for the insight

Arnon Rotem-Gal-Oz · Accepted Answer · 2015-07-09 10:01:20Z

1

PigStorage creates a text file without schema information so you need to do that work yourself something like

sc.textFile("file") // or directory where the part files are  
val data = csv.map(line => {
   vals=line.split("|")
   (vals(0).toInt,vals(1),vals(2).toDouble)}
)

answered Jul 9, 2015 at 10:01

Arnon Rotem-Gal-Oz

26k3 gold badges51 silver badges70 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ravinder Karra Over a year ago

You can split in haddopFile as ... val dataRDD = sc.hadoopFile("wc2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text],sc.defaultMinPartitions).map(pair => pair._2.toString).map(r => r.split("|"))

Collectives™ on Stack Overflow

Spark/Scala read hadoop file

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related