while loop on a map spark dataframe

Question

i have this problem with a mapType in spark using scala API for each session we are sending a map in which you can find the categories the user visited associated with the number of events in each category

[ home & personal items > interior -> 1, vehicles > cars -> 1]

Not all the user visit the same number of categories so the size of the map changes based on the user_id

i need to calculate the number of session grouped by category in order to that i need to loop over the map and while its not it's not empty things i've tried before

while (size(col("categoriesRaw")) !== 0) {
    df.select(
        explode(col("categoriesRaw"))
    )
    .select(
        col("key").alias("categ"),
        col("value").alias("number_of_events")
    )
}

but i'm facing some errors like :

type mismatch;
 found   : org.apache.spark.sql.Column
 required: Booleansbt

@Kaushal something like: StructField("sessionId", StringType, true), StructField("categoriesRaw", MapType(StringType, IntegerType, true), true), — Khaoula Arfaoui
– Khaoula Arfaoui, Commented May 1, 2019 at 23:18
@YayatiSule something like [ home & personal items > interior -> 1, vehicles > cars -> 1] [ vehicles > cars -> 3] — Khaoula Arfaoui
– Khaoula Arfaoui, Commented May 2, 2019 at 9:26
is your raw data in JSON format? Like this over here [{"home_and_personal":{"interior":1},"vehicles":{"cars":1}}] — Yayati Sule
– Yayati Sule, Commented May 2, 2019 at 9:30

Oli · Accepted Answer · 2019-05-02 13:59:14Z

1

I'm not sure what you are trying to do with the while loop. Anyway, you can check with the REPL that the expression you use as a condition is a Column and not a Boolean, hence the Exception.

> size(col("categoriesRaw")) !== 0
res1: org.apache.spark.sql.Column = (NOT (size(categoriesRaw) = 0))

Basically, this is an expression that needs to be evaluated by SparkSQL within a where, select or any other function that uses Columns.

Nevertheless, with your spark code you are almost there, you just need to add a groupBy to get where you want. Let's start by creating your data.

import spark.implicits._
val users = Seq( "user 1" -> Map("home & personal items > interior" -> 1,
                                 "vehicles > cars" -> 1), 
                 "user 2" -> Map("vehicles > cars" -> 3)) 
val df = users.toDF("user", "categoriesRaw")

Then, you don't need a while loop to iterate over all the values of the maps. explode does exactly that for you:

val explodedDf = df.select( explode('categoriesRaw) )
explodedDf.show(false)

+--------------------------------+-----+
|key                             |value|
+--------------------------------+-----+
|home & personal items > interior|1    |        
|vehicles > cars                 |1    |
|vehicles > cars                 |3    |
+--------------------------------+-----+

Finally, you can use groupBy add get what you want.

explodedDf
    .select('key as "categ", 'value as "number_of_events")
    .groupBy("categ")
    .agg(count('*), sum('number_of_events))
    .show(false)

+--------------------------------+--------+---------------------+
|categ                           |count(1)|sum(number_of_events)|
+--------------------------------+--------+---------------------+
|home & personal items > interior|1       |1                    |
|vehicles > cars                 |2       |4                    |
+--------------------------------+--------+---------------------+

NB: I was not sure if you wanted to count the sessions (1st column) or the events (2nd column) so I computed both.

edited May 2, 2019 at 13:59

answered May 1, 2019 at 17:38

Oli

10.5k5 gold badges31 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Khaoula Arfaoui Over a year ago

oky let me explain myslef to you . i'm trying to do the while loop because when i do the explode function and then select (key) and value its only taking the first (key,value)not all of them... maybe the size of each Rawcategory changes from a user to another

Oli Over a year ago

It would greatly help if you could provide sample data and the expected output. This way we would understand better what you are trying tout do.

Khaoula Arfaoui Over a year ago

input: StructField("sessionId", StringType, true), StructField("categoriesRaw", MapType(StringType, IntegerType, true), true), the map is like : user 1 [ home & personal items > interior -> 1, vehicles > cars -> 1] user 2: [ vehicles > cars -> 3]

Khaoula Arfaoui Over a year ago

i'd like to do a loop that will loop over a map as long as not empty (because we dont know the exact size of each map) and explode ALL the key values of that map (because if i dont do a loop its only taking the first key value)

Oli Over a year ago

Can you add all this information in your question? It will help others with a similar problem as yours find a solution more easily. And people who want to help you will have all the info in one place.

|

Collectives™ on Stack Overflow

while loop on a map spark dataframe

1 Answer 1

10 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Related