1

I need to loop through all the rows of a Spark dataframe and use the values in each row as inputs for a function.

Basically, I want this to happen:

  1. Get row of database
  2. Separate the values in the database's row into different variables
  3. Use those variables as inputs for a function I defined

The thing is, I can't use collect() because the dataframe is too big.

I am pretty sure I have to use map() to perform what I want and I have tried doing this:

MyDF.rdd.map(MyFunction)

But how can I specify the information I want to retrieve from the DataFrame? Something like Row(0), Row(1) and Row(2)?

And how do I "feed" those values to my function?

1 Answer 1

3

"Looping" is not what you really want, but a "projection". If for example your dataframe has 2 fields of type int and string, your code would look like this:

val myFunction = (i:Int,s:String) =>  ??? // do something with the variables

df.rdd.map(row => myFunction(row.getAs[Int]("field1"), row.getAs[String]("field2")))

or with pattern matching :

df.rdd.map{case Row(field1:Int, field2:String) => myFunction(field1,field2)}

Note that in Spark 2, you can directly use map on your dataframe and get a new dataframe (in spark 1.6 map would result in a RDD instead).

Note that instead of using map in RDD you could also use an "User Defined Function" (UDF) in the dataframe API

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.