How can I loop through all the rows of a Spark dataframe and apply a function to each row?

Question

I need to loop through all the rows of a Spark dataframe and use the values in each row as inputs for a function.

Basically, I want this to happen:

Get row of database
Separate the values in the database's row into different variables
Use those variables as inputs for a function I defined

The thing is, I can't use collect() because the dataframe is too big.

I am pretty sure I have to use map() to perform what I want and I have tried doing this:

MyDF.rdd.map(MyFunction)

But how can I specify the information I want to retrieve from the DataFrame? Something like Row(0), Row(1) and Row(2)?

And how do I "feed" those values to my function?

Raphael Roth · Accepted Answer · 2018-02-11 20:06:37Z

3

"Looping" is not what you really want, but a "projection". If for example your dataframe has 2 fields of type int and string, your code would look like this:

val myFunction = (i:Int,s:String) =>  ??? // do something with the variables

df.rdd.map(row => myFunction(row.getAs[Int]("field1"), row.getAs[String]("field2")))

or with pattern matching :

df.rdd.map{case Row(field1:Int, field2:String) => myFunction(field1,field2)}

Note that in Spark 2, you can directly use map on your dataframe and get a new dataframe (in spark 1.6 map would result in a RDD instead).

Note that instead of using map in RDD you could also use an "User Defined Function" (UDF) in the dataframe API

edited Feb 11, 2018 at 20:06

answered Feb 11, 2018 at 19:45

Raphael Roth

27.3k19 gold badges98 silver badges152 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How can I loop through all the rows of a Spark dataframe and apply a function to each row?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related