I need to loop through all the rows of a Spark dataframe and use the values in each row as inputs for a function.
Basically, I want this to happen:
- Get row of database
- Separate the values in the database's row into different variables
- Use those variables as inputs for a function I defined
The thing is, I can't use collect() because the dataframe is too big.
I am pretty sure I have to use map() to perform what I want and I have tried doing this:
MyDF.rdd.map(MyFunction)
But how can I specify the information I want to retrieve from the DataFrame? Something like Row(0), Row(1) and Row(2)?
And how do I "feed" those values to my function?