1

In a pig script I saved a table using PigStorage('|'). I have in the corresponding hadoop folder files like

part-r-00000

etc. What is the best way to load it in Spark/Scala ? In this table I have 3 fields: Int, String, Float

I tried:

text = sc.hadoopFile("file", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitions)

But then I would need somehow to split each line. Is there a better way to do it?

If I were coding in python I would create a Dataframe indexed by the first field and whose columns are the values found in the string field and coefficients the float values. But I need to use scala to use the pca module. And the dataframes don't seem that close to python's ones

Thanks for the insight

1 Answer 1

1

PigStorage creates a text file without schema information so you need to do that work yourself something like

sc.textFile("file") // or directory where the part files are  
val data = csv.map(line => {
   vals=line.split("|")
   (vals(0).toInt,vals(1),vals(2).toDouble)}
)
Sign up to request clarification or add additional context in comments.

1 Comment

You can split in haddopFile as ... val dataRDD = sc.hadoopFile("wc2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text],sc.defaultMinPartitions).map(pair => pair._2.toString).map(r => r.split("|"))

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.