How can I loop through a Spark data frame? I have a data frame that consists of:
time, id, direction
10, 4, True //here 4 enters --> (4,)
20, 5, True //here 5 enters --> (4,5)
34, 5, False //here 5 leaves --> (4,)
67, 6, True //here 6 enters --> (4,6)
78, 6, False //here 6 leaves --> (4,)
99, 4, False //here 4 leaves --> ()
it is sorted by time and now I would like to step through and accumulate the valid ids. The ids enter on direction==True and exit on direction==False
so the resulting RDD should look like this
time, valid_ids
(10, (4,))
(20, (4,5))
(34, (4,))
(67, (4,6))
(78, (4,)
(99, ())
I know that this will not parallelize, but the df is not that big. So how could this be done in Spark/Scala?