0

i have this problem with a mapType in spark using scala API for each session we are sending a map in which you can find the categories the user visited associated with the number of events in each category

[ home & personal items > interior -> 1, vehicles > cars -> 1] 

Not all the user visit the same number of categories so the size of the map changes based on the user_id

i need to calculate the number of session grouped by category in order to that i need to loop over the map and while its not it's not empty things i've tried before

while (size(col("categoriesRaw")) !== 0) {
    df.select(
        explode(col("categoriesRaw"))
    )
    .select(
        col("key").alias("categ"),
        col("value").alias("number_of_events")
    )
}

but i'm facing some errors like :

type mismatch;
 found   : org.apache.spark.sql.Column
 required: Booleansbt
6
  • Can you share sample dataframe? Commented May 1, 2019 at 17:28
  • @Kaushal something like: StructField("sessionId", StringType, true), StructField("categoriesRaw", MapType(StringType, IntegerType, true), true), Commented May 1, 2019 at 23:18
  • can you share a sample data field? Commented May 2, 2019 at 7:35
  • @YayatiSule something like [ home & personal items > interior -> 1, vehicles > cars -> 1] [ vehicles > cars -> 3] Commented May 2, 2019 at 9:26
  • is your raw data in JSON format? Like this over here [{"home_and_personal":{"interior":1},"vehicles":{"cars":1}}] Commented May 2, 2019 at 9:30

1 Answer 1

1

I'm not sure what you are trying to do with the while loop. Anyway, you can check with the REPL that the expression you use as a condition is a Column and not a Boolean, hence the Exception.

> size(col("categoriesRaw")) !== 0
res1: org.apache.spark.sql.Column = (NOT (size(categoriesRaw) = 0))

Basically, this is an expression that needs to be evaluated by SparkSQL within a where, select or any other function that uses Columns.

Nevertheless, with your spark code you are almost there, you just need to add a groupBy to get where you want. Let's start by creating your data.

import spark.implicits._
val users = Seq( "user 1" -> Map("home & personal items > interior" -> 1,
                                 "vehicles > cars" -> 1), 
                 "user 2" -> Map("vehicles > cars" -> 3)) 
val df = users.toDF("user", "categoriesRaw")

Then, you don't need a while loop to iterate over all the values of the maps. explode does exactly that for you:

val explodedDf = df.select( explode('categoriesRaw) )
explodedDf.show(false)

+--------------------------------+-----+
|key                             |value|
+--------------------------------+-----+
|home & personal items > interior|1    |        
|vehicles > cars                 |1    |
|vehicles > cars                 |3    |
+--------------------------------+-----+ 

Finally, you can use groupBy add get what you want.

explodedDf
    .select('key as "categ", 'value as "number_of_events")
    .groupBy("categ")
    .agg(count('*), sum('number_of_events))
    .show(false)

+--------------------------------+--------+---------------------+
|categ                           |count(1)|sum(number_of_events)|
+--------------------------------+--------+---------------------+
|home & personal items > interior|1       |1                    |
|vehicles > cars                 |2       |4                    |
+--------------------------------+--------+---------------------+

NB: I was not sure if you wanted to count the sessions (1st column) or the events (2nd column) so I computed both.

Sign up to request clarification or add additional context in comments.

10 Comments

oky let me explain myslef to you . i'm trying to do the while loop because when i do the explode function and then select (key) and value its only taking the first (key,value)not all of them... maybe the size of each Rawcategory changes from a user to another
It would greatly help if you could provide sample data and the expected output. This way we would understand better what you are trying tout do.
input: StructField("sessionId", StringType, true), StructField("categoriesRaw", MapType(StringType, IntegerType, true), true), the map is like : user 1 [ home & personal items > interior -> 1, vehicles > cars -> 1] user 2: [ vehicles > cars -> 3]
i'd like to do a loop that will loop over a map as long as not empty (because we dont know the exact size of each map) and explode ALL the key values of that map (because if i dont do a loop its only taking the first key value)
Can you add all this information in your question? It will help others with a similar problem as yours find a solution more easily. And people who want to help you will have all the info in one place.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.