How can I loop through a Spark data frame

Question

How can I loop through a Spark data frame? I have a data frame that consists of:

time, id, direction
10, 4, True  //here 4 enters --> (4,)
20, 5, True  //here 5 enters --> (4,5)
34, 5, False //here 5 leaves --> (4,)
67, 6, True  //here 6 enters --> (4,6)
78, 6, False //here 6 leaves --> (4,)
99, 4, False //here 4 leaves --> ()

it is sorted by time and now I would like to step through and accumulate the valid ids. The ids enter on direction==True and exit on direction==False

so the resulting RDD should look like this

time, valid_ids
(10, (4,))
(20, (4,5))
(34, (4,))
(67, (4,6))
(78, (4,)
(99, ())

I know that this will not parallelize, but the df is not that big. So how could this be done in Spark/Scala?

Alper t. Turker · Accepted Answer · 2017-11-13 22:07:53Z

4

If data is small ("but the df is not that big") I'd just collect and process using Scala collections. If types are as shown below:

df.printSchema
root
 |-- time: integer (nullable = false)
 |-- id: integer (nullable = false)
 |-- direction: boolean (nullable = false)

you can collect:

val data = df.as[(Int, Int, Boolean)].collect.toSeq

and scanLeft:

val result = data.scanLeft((-1, Set[Int]())){ 
  case ((_, acc), (time, value, true)) => (time, acc + value)
  case ((_, acc), (time, value, false))  => (time, acc - value)
}.tail

answered Nov 13, 2017 at 22:07

Alper t. Turker

35.3k9 gold badges89 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Anahcolus · Accepted Answer · 2017-11-14 02:48:23Z

1

Use of var is not recommended for scala developers but still I am posting answer using var

var collectArray = Array.empty[Int]
df.rdd.collect().map(row => {
  if(row(2).toString.equalsIgnoreCase("true")) collectArray = collectArray :+ row(1).asInstanceOf[Int]
  else collectArray = collectArray.drop(1)
  (row(0), collectArray.toList)
})

this should give you result as

(10,List(4))
(20,List(4, 5))
(34,List(5))
(67,List(5, 6))
(78,List(6))
(99,List())

answered Nov 14, 2017 at 2:48

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Comments

MBT · Accepted Answer · 2018-09-07 11:08:42Z

0

Suppose the name of the respective data frame is someDF, then do:

val df1 = someDF.rdd.collect.iterator;
   while(df1.hasNext) 
   {
       println(df1.next);
   }

edited Sep 7, 2018 at 11:08

MBT

24.6k23 gold badges96 silver badges113 bronze badges

answered Sep 7, 2018 at 10:47

dinesh rajput

1141 silver badge6 bronze badges

Collectives™ on Stack Overflow

How can I loop through a Spark data frame

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related