2

I try to read a csv with java and spark.

Now I do this:

    String master = "local[2]";
    String csvInput = "/home/username/Downloads/countrylist.csv";
    String csvOutput = "/home/username/Downloads/countrylist";

    JavaSparkContext sc = new JavaSparkContext(master, "loadwholecsv", System.getenv("SPARK_HOME"), System.getenv("JARS"));

    JavaRDD<String> csvData = sc.textFile(csvInput, 1);
    JavaRDD<List<String>> lines = csvData.map(new Function <String, List<String>>() {
        @Override
        public List<String> call(String s) {
            return new ArrayList<String>(Arrays.asList(s.split("\\s*,\\s*")));
        }
    });

So I have all the "lines" of the csv-file as a line in my RDD. I also wrote this method for getting a column:

public static JavaRDD<String> getColumn (JavaRDD<List<String>> data, final int index)
{
    return data.flatMap(
        new FlatMapFunction <List<String>, String>() 
        {
            public Iterable<String> call (List<String> s) 
            {
                return Arrays.asList(s.get(index));
            }
        }
    );
}

But later I want to do many transformations on columns and change position of columns etc. So it would be easier to have an RDD filled with the COLUMNS as Arraylists, not the LINES.

Has anyone an idea how to achieve this? I don't want to call "getColumn()" n-times.

Would be great if you can help me.

Explanation: My csvData looks like this:

one, two, three
four, five, six
seven, eight, nine

My lines RDD looks like this:

[one, two, three]
[four, five, six]
[seven, eigth, nine]

But I want this:

[one, four, seven]
[two, five, eight]
[three, six, nine]
6
  • So what is the type of the expected RDD? RDD<String[]> ? Commented Nov 8, 2014 at 20:35
  • It's RDD<List<String>> Commented Nov 8, 2014 at 20:37
  • So it's what you have already. What needs to change? Commented Nov 8, 2014 at 20:42
  • As i say, i want to have COLUMNS not LINES in my RDD<List<String>> Commented Nov 8, 2014 at 20:44
  • So given your original data is "1, one, uno \ 2, two, dos \ 3, three, tres", and your current RDD is ["1", "one", "uno" \ "2", "two", "dos" \ "3", "three", "tres"], you would like an RDD: ["1","2","3" \ "one", "two", "three" \ "uno", "dos", "tres"], basically transposing the RDD? Commented Nov 8, 2014 at 20:51

2 Answers 2

2

To do a map-reduce based matrix transposal, which is basically what is being asked, you would proceed by:

  1. Transform your lines into indexed tuples: (hint: use zipWithIndex and map)

    [(1,1,one), (1,2,two), (1,3,three)]
    [(2,1,four), (2,2,five), (2,3,six)]
    [(3,1,seven), (3,2,eigth), (3,3,nine)]
    
  2. Add the column as key to each tuple: (hint: use map)

    [(1,(1,1,one)), (2,(1,2,two)), (3,(1,3,three))]
    [(1,(2,1,four)), (2,(2,2,five)),(3,(2,3,six))]
    [(1,(3,1,seven)), (2,(3,2,eigth)), (3,(3,3,nine))]
    
  3. Group by key

    [(1,[(3,1,seven), (1,1,one), (2,1,four)])]
    [(2,[(1,2,two), (3,2,eigth), (2,2,five)])]
    [(3,[,(2,3,six),(1,3,three), (3,3,nine))])]
    
  4. Sort values back in order and remove the indexing artifacts (hint: map)

    [ one, four, seven ]
    [ two, five, eigth ]
    [ three, six, nine ]
    
Sign up to request clarification or add additional context in comments.

5 Comments

How can i do this on the lists inside the RDD? By using zipWithIndex i would get a Tuple like (LIST, INDEX). I am a bit confused, how to get ([(element,index),(element,index)])
@progNewFag how would you transform List(A, B, C) into List((A,1), (B,2),(C,3))? Hint: plain Java, not Spark
I did the first two steps: pastie.org/private/u8zjkgkf6qakw96uk7dq . But i don't know how to do the grouping above the lists. Can you help me?
I am a bit confused, because i dont have an JavaPairRDD. Instead I have just a JavaRDD, how to group there?
0
SparkSession spark = SparkSession.builder().appName("csvReader").master("local[2]").config("com.databricks.spark.csv","some-value").getOrCreate();  

String path ="C://Users//U6048715//Desktop//om.csv";    

Dataset<org.apache.spark.sql.Row> df =spark.read().csv(path);   
df.show();

1 Comment

output : +-------------------+---+----+ | _c0|_c1| _c2| +-------------------+---+----+ |CCC:000455763800001| 1| F1| | WOS:00045576380002| 2| F2| | WOS:00045576380003| 3|null| | WOS:00045576380004| 0| F4| | WOS:00045576380005| 9| F5| | WOS:00045576380006| 2| s1| | WOS:00045576380007| 4| s2| | WOS:00045576380008| 4| s4| | WOS:00045576380009| 1| s5| +-------------------+---+----+

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.