How to read from csv to a ListMap

Question

I have been assigned with the task of reading from csv, and creating a ListMap variable. The reason to use this specific class is that for some other use cases they were already using a number of methods with ListMap as input parameter, and they want one more.

What I have done so far is: read from the csv, and create a rdd. The format of the csv is

"field1,field2"
"value1,value2"
"value3,value4"

In this rdd I have tuples of strings. What I would like is to now convert this to a ListMap class. So what I have is a variable with the type Array[(value1,value2),(value3,value4)].

I did this because I find it easy to go from a csv to tuples. The problem is I do not find any way to go from here to a ListMap. It seems easier to get a normal Map class, but as I said, it is required for the final result to be a ListMap type of object.

I have been reading but I do not really understand this answer nor this one

it is separated by "," by which I mean, each line, the fields are separated by the "," character. And each line ends in a "\n" — monkey intern
– monkey intern, Commented Nov 10, 2017 at 11:42
is my edit of input csv data in your question correct? if yes are there " (inverted commas) in your data? — Anahcolus
– Anahcolus, Commented Nov 10, 2017 at 11:45
Yes there are inverted commas. Sometimes the strings are strings and sometimes they have inverted commas. I am trying to show some code but since I am changing it so often to try different cases and working with teammates I'm afraid I do not have something worth showing right now :S — monkey intern
– monkey intern, Commented Nov 10, 2017 at 11:57

Anahcolus · Accepted Answer · 2017-11-10 12:25:23Z

1

Depending on the sample data you provided, you can use collectAsMap api to get the final ListMap

val rdd  = sparkSession.sparkContext.textFile("path to the text file")
  .map(line => line.split(","))
  .map(array => array(0) -> array(1))
  .collectAsMap()

Thats it.

Now if you want to go a step further you can do additional step as

  var listMap : ListMap[String, String] = ListMap.empty[String, String]
  for(map <- rdd) {
    listMap += map
  }

answered Nov 10, 2017 at 12:25

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

monkey intern Over a year ago

I ended up using both answers, and they both were incredibly helpful. Any advice on how to reflect that? I guess there is a meta post talking about this issue, there always is haha

Dima · Accepted Answer · 2017-11-10 11:50:43Z

1

Array("foo" -> "bar", "baz" -> "bat").toMap gives you a Map. If you are looking for a ListMap specifically (for the life of me, can't think of a reason why you would), then you need a breakOut:

 val map: ListMap[String, String] = 
    Array("foo" -> "bar", "baz" -> "bat")
     .toMap
     .map(identity)(scala.collection.breakOut)

breakOut is sort of a "collection factory" that lets you implicitly convert between different collection types. You can read more about it here: https://docs.scala-lang.org/tutorials/FAQ/breakout.html

answered Nov 10, 2017 at 11:50

Dima

40.6k6 gold badges54 silver badges80 bronze badges

6 Comments

monkey intern Over a year ago

Ooooh so that is what breakOut is. Btw I do not know either. I hope to have time to have it explained to me one of this days... More importantly, I wanted to say I sadly wrote the question wrong. When I read a csv what I get is RDD[(String,String)]. My idea at this point is to collect said RDD, to get an array, and then be able to do what you have taught me. But I have this feeling about how aaaaaaaaall of this is doing it wrong (on my part).

Dima Over a year ago

It's not "wrong" (except, maybe, for the part of using spark for this in the first place), as long as it fits into memory. And if you want to end up with a single Map in the end, it better do :)

monkey intern Over a year ago

So Spark is the wrong tool for this, right? I was just talking about this the other day. Hive or Impala would make more sense I assume. Since it's a file we could store straight up in hdfs. What other choices would be recommended for a problem similar to this? And yeah, thanks for the warning about the memory and the map issue. I did look previously into it, it seems to be, for now, a small amount of info, because I was very worried about doing collect()

Dima Over a year ago

You don't need hive (or even hdfs) either. Just drop a file on disk, and read it directly.

Dima Over a year ago

Source.fromFile("filename.csv").getLines

|

Collectives™ on Stack Overflow

How to read from csv to a ListMap

2 Answers 2

1 Comment

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related