I try to read a csv with java and spark.
Now I do this:
String master = "local[2]";
String csvInput = "/home/username/Downloads/countrylist.csv";
String csvOutput = "/home/username/Downloads/countrylist";
JavaSparkContext sc = new JavaSparkContext(master, "loadwholecsv", System.getenv("SPARK_HOME"), System.getenv("JARS"));
JavaRDD<String> csvData = sc.textFile(csvInput, 1);
JavaRDD<List<String>> lines = csvData.map(new Function <String, List<String>>() {
@Override
public List<String> call(String s) {
return new ArrayList<String>(Arrays.asList(s.split("\\s*,\\s*")));
}
});
So I have all the "lines" of the csv-file as a line in my RDD. I also wrote this method for getting a column:
public static JavaRDD<String> getColumn (JavaRDD<List<String>> data, final int index)
{
return data.flatMap(
new FlatMapFunction <List<String>, String>()
{
public Iterable<String> call (List<String> s)
{
return Arrays.asList(s.get(index));
}
}
);
}
But later I want to do many transformations on columns and change position of columns etc. So it would be easier to have an RDD filled with the COLUMNS as Arraylists, not the LINES.
Has anyone an idea how to achieve this? I don't want to call "getColumn()" n-times.
Would be great if you can help me.
Explanation: My csvData looks like this:
one, two, three
four, five, six
seven, eight, nine
My lines RDD looks like this:
[one, two, three]
[four, five, six]
[seven, eigth, nine]
But I want this:
[one, four, seven]
[two, five, eight]
[three, six, nine]
RDD<String[]>?"1, one, uno \ 2, two, dos \ 3, three, tres", and your current RDD is["1", "one", "uno" \ "2", "two", "dos" \ "3", "three", "tres"], you would like an RDD:["1","2","3" \ "one", "two", "three" \ "uno", "dos", "tres"], basically transposing the RDD?