-1

I have been assigned with the task of reading from csv, and creating a ListMap variable. The reason to use this specific class is that for some other use cases they were already using a number of methods with ListMap as input parameter, and they want one more.

What I have done so far is: read from the csv, and create a rdd. The format of the csv is

"field1,field2"
"value1,value2"
"value3,value4"

In this rdd I have tuples of strings. What I would like is to now convert this to a ListMap class. So what I have is a variable with the type Array[(value1,value2),(value3,value4)].

I did this because I find it easy to go from a csv to tuples. The problem is I do not find any way to go from here to a ListMap. It seems easier to get a normal Map class, but as I said, it is required for the final result to be a ListMap type of object.

I have been reading but I do not really understand this answer nor this one

5
  • is your csv file a space delimited? Commented Nov 10, 2017 at 11:41
  • it is separated by "," by which I mean, each line, the fields are separated by the "," character. And each line ends in a "\n" Commented Nov 10, 2017 at 11:42
  • is my edit of input csv data in your question correct? if yes are there " (inverted commas) in your data? Commented Nov 10, 2017 at 11:45
  • can you share your codes too if possible? Commented Nov 10, 2017 at 11:54
  • Yes there are inverted commas. Sometimes the strings are strings and sometimes they have inverted commas. I am trying to show some code but since I am changing it so often to try different cases and working with teammates I'm afraid I do not have something worth showing right now :S Commented Nov 10, 2017 at 11:57

2 Answers 2

1

Depending on the sample data you provided, you can use collectAsMap api to get the final ListMap

val rdd  = sparkSession.sparkContext.textFile("path to the text file")
  .map(line => line.split(","))
  .map(array => array(0) -> array(1))
  .collectAsMap()

Thats it.

Now if you want to go a step further you can do additional step as

  var listMap : ListMap[String, String] = ListMap.empty[String, String]
  for(map <- rdd) {
    listMap += map
  }
Sign up to request clarification or add additional context in comments.

1 Comment

I ended up using both answers, and they both were incredibly helpful. Any advice on how to reflect that? I guess there is a meta post talking about this issue, there always is haha
1

Array("foo" -> "bar", "baz" -> "bat").toMap gives you a Map. If you are looking for a ListMap specifically (for the life of me, can't think of a reason why you would), then you need a breakOut:

 val map: ListMap[String, String] = 
    Array("foo" -> "bar", "baz" -> "bat")
     .toMap
     .map(identity)(scala.collection.breakOut)

breakOut is sort of a "collection factory" that lets you implicitly convert between different collection types. You can read more about it here: https://docs.scala-lang.org/tutorials/FAQ/breakout.html

6 Comments

Ooooh so that is what breakOut is. Btw I do not know either. I hope to have time to have it explained to me one of this days... More importantly, I wanted to say I sadly wrote the question wrong. When I read a csv what I get is RDD[(String,String)]. My idea at this point is to collect said RDD, to get an array, and then be able to do what you have taught me. But I have this feeling about how aaaaaaaaall of this is doing it wrong (on my part).
It's not "wrong" (except, maybe, for the part of using spark for this in the first place), as long as it fits into memory. And if you want to end up with a single Map in the end, it better do :)
So Spark is the wrong tool for this, right? I was just talking about this the other day. Hive or Impala would make more sense I assume. Since it's a file we could store straight up in hdfs. What other choices would be recommended for a problem similar to this? And yeah, thanks for the warning about the memory and the map issue. I did look previously into it, it seems to be, for now, a small amount of info, because I was very worried about doing collect()
You don't need hive (or even hdfs) either. Just drop a file on disk, and read it directly.
Source.fromFile("filename.csv").getLines
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.