Read column from csv with java spark

Question

I try to read a csv with java and spark.

Now I do this:

    String master = "local[2]";
    String csvInput = "/home/username/Downloads/countrylist.csv";
    String csvOutput = "/home/username/Downloads/countrylist";

    JavaSparkContext sc = new JavaSparkContext(master, "loadwholecsv", System.getenv("SPARK_HOME"), System.getenv("JARS"));

    JavaRDD<String> csvData = sc.textFile(csvInput, 1);
    JavaRDD<List<String>> lines = csvData.map(new Function <String, List<String>>() {
        @Override
        public List<String> call(String s) {
            return new ArrayList<String>(Arrays.asList(s.split("\\s*,\\s*")));
        }
    });

So I have all the "lines" of the csv-file as a line in my RDD. I also wrote this method for getting a column:

public static JavaRDD<String> getColumn (JavaRDD<List<String>> data, final int index)
{
    return data.flatMap(
        new FlatMapFunction <List<String>, String>() 
        {
            public Iterable<String> call (List<String> s) 
            {
                return Arrays.asList(s.get(index));
            }
        }
    );
}

But later I want to do many transformations on columns and change position of columns etc. So it would be easier to have an RDD filled with the COLUMNS as Arraylists, not the LINES.

Has anyone an idea how to achieve this? I don't want to call "getColumn()" n-times.

Would be great if you can help me.

Explanation: My csvData looks like this:

one, two, three
four, five, six
seven, eight, nine

My lines RDD looks like this:

[one, two, three]
[four, five, six]
[seven, eigth, nine]

But I want this:

[one, four, seven]
[two, five, eight]
[three, six, nine]

As i say, i want to have COLUMNS not LINES in my RDD<List<String>> — progNewbie
– progNewbie, Commented Nov 8, 2014 at 20:44
So given your original data is "1, one, uno \ 2, two, dos \ 3, three, tres", and your current RDD is ["1", "one", "uno" \ "2", "two", "dos" \ "3", "three", "tres"], you would like an RDD: ["1","2","3" \ "one", "two", "three" \ "uno", "dos", "tres"], basically transposing the RDD? — maasg
– maasg, Commented Nov 8, 2014 at 20:51

maasg · Accepted Answer · 2014-11-09 01:52:11Z

2

To do a map-reduce based matrix transposal, which is basically what is being asked, you would proceed by:

Transform your lines into indexed tuples: (hint: use zipWithIndex and map)

[(1,1,one), (1,2,two), (1,3,three)]
[(2,1,four), (2,2,five), (2,3,six)]
[(3,1,seven), (3,2,eigth), (3,3,nine)]

Add the column as key to each tuple: (hint: use map)

[(1,(1,1,one)), (2,(1,2,two)), (3,(1,3,three))]
[(1,(2,1,four)), (2,(2,2,five)),(3,(2,3,six))]
[(1,(3,1,seven)), (2,(3,2,eigth)), (3,(3,3,nine))]

Group by key

[(1,[(3,1,seven), (1,1,one), (2,1,four)])]
[(2,[(1,2,two), (3,2,eigth), (2,2,five)])]
[(3,[,(2,3,six),(1,3,three), (3,3,nine))])]

Sort values back in order and remove the indexing artifacts (hint: map)
```
[ one, four, seven ]
[ two, five, eigth ]
[ three, six, nine ]
```

answered Nov 9, 2014 at 1:52

maasg

37.5k14 gold badges91 silver badges116 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

progNewbie Over a year ago

How can i do this on the lists inside the RDD? By using zipWithIndex i would get a Tuple like (LIST, INDEX). I am a bit confused, how to get ([(element,index),(element,index)])

maasg Over a year ago

@progNewFag how would you transform List(A, B, C) into List((A,1), (B,2),(C,3))? Hint: plain Java, not Spark

progNewbie Over a year ago

I did the first two steps: pastie.org/private/u8zjkgkf6qakw96uk7dq . But i don't know how to do the grouping above the lists. Can you help me?

maasg Over a year ago

@progNewFag spark.apache.org/docs/1.1.0/api/java/org/apache/spark/api/java/…

progNewbie Over a year ago

I am a bit confused, because i dont have an JavaPairRDD. Instead I have just a JavaRDD, how to group there?

abiratsis · Accepted Answer · 2019-05-06 13:43:09Z

0

SparkSession spark = SparkSession.builder().appName("csvReader").master("local[2]").config("com.databricks.spark.csv","some-value").getOrCreate();  

String path ="C://Users//U6048715//Desktop//om.csv";    

Dataset<org.apache.spark.sql.Row> df =spark.read().csv(path);   
df.show();

edited May 6, 2019 at 13:43

abiratsis

7,3414 gold badges31 silver badges49 bronze badges

answered May 6, 2019 at 9:40

OM Prakash Singh

334 bronze badges

1 Comment

OM Prakash Singh Over a year ago

output : +-------------------+---+----+ | _c0|_c1| _c2| +-------------------+---+----+ |CCC:000455763800001| 1| F1| | WOS:00045576380002| 2| F2| | WOS:00045576380003| 3|null| | WOS:00045576380004| 0| F4| | WOS:00045576380005| 9| F5| | WOS:00045576380006| 2| s1| | WOS:00045576380007| 4| s2| | WOS:00045576380008| 4| s4| | WOS:00045576380009| 1| s5| +-------------------+---+----+

Collectives™ on Stack Overflow

Read column from csv with java spark

2 Answers 2

5 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related