0

I am trying to add elements in a mutable scala list as below. I am reading the values from a dataframe row by row, extracting out values of column with name "_title" and adding it to list. But when for loop is complete, the list is still enmpty. this is the code:

import scala.collection.mutable.ListBuffer
val flatK = dfR.withColumn("UserValue", explode(col("UserValue")))
var colListA = new ListBuffer[String]()
//    var colSet : List[String] = List()
    for(i <- 0 until Integer.parseInt(dfR.count().toString)){
      flatK.filter($"columnIndex" === i).foreach{
        r=>
          val columnName = r.getAs[Row]("UserValue").getAs[String]("_title")
//          println(columnName)
          colListA.append(columnName)
      }
    }

println(columnName) actually prints the value I want to put inside my list. My dataframe dfR looks like this:

 +--------------------------------------------------------------+-----------+
|UserValue                                                     |columnIndex|
+--------------------------------------------------------------+-----------+
|[, last_mod_date, 2009-01-14T13:40:53]                        |0          |
|[, object_string, SOLIDS]                                     |0          |
|[, last_mod_date, 2009-01-13T22:58:30]                        |1          |
|[, object_string, TORSO]                                      |1          |

When I do

colListA += "elements"
colListA += "adds"

I can see elements added. But not inside that foreach loop. Can any one tell me what shall I try? Basically, I expect colList to be populated with last_mod_date and object_string.

6
  • 1
    This has been asked a lot of times before, but I can not find the excellent duplicate people always use. Anyways, short answer: The foreach is not executed in the driver (where your buffer exists) but on the executors (all of them had a local a copy of the buffer). At the end, each copy was modified but the results are not synced with the driver, thus the main buffer stays empty. This common novice mistake is done because a poorly understanding of Sparks architecture. I would recommend you to read a little bit about how spark works and what are they use cases. Commented Mar 11, 2019 at 22:11
  • Cool. But how can I achieve the purpose that I want here? If its a duplicate, is it possible for you to direct me to that link? Commented Mar 11, 2019 at 22:13
  • As I said, I could not find the exact duplicate with has super clear answer, but you can easily find a lot of similar questions. Now, how can you do it may depend on your real use case. You may simple call collect in order to have all values in your driver (which looks like what you want) - be warned that, if the DF is big, you may just blow your memory. Spark was intended for working with large amount of data that wont fit in one machine, but since you are already working in local mode, just for debugging then you are done. Commented Mar 11, 2019 at 22:20
  • well, my dataframe could be very large actually. So if I do collect, it might run out of memory as well. Let me try to figure something else out. Commented Mar 11, 2019 at 22:22
  • You are trying to add each string from a List to mutuable.ListBuffer. Please correct me? Commented Mar 12, 2019 at 4:53

1 Answer 1

0

If you want to create a list from column of a dataframe dataframe.select("_title").collect().map(_(0).asInstanceOf[String]).toList

you can get the list of string of your column.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.