0

I want to implement an functionality to remove an element from an array of struct in spark scala.For the date "2019-01-26" I want to remove the entire struct from the array column. Following is my code :

  import org.apache.spark.sql.types._
    
    val df=Seq(("123","Jack",Seq(("2020-04-26","200","72","ABC"),("2020-05-26","300","71","ABC"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"))),("124","jones",Seq(("2020-04-26","200","72","ABC"),("2020-05-26","300","71","ABC"),("2020-06-26","200","70","ABC"),("2020-08-26","300","69","ABC"),("2020-08-26","300","69","ABC"))),("125","daniel",Seq(("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF")))).toDF("id","name","history").withColumn("history",$"history".cast("array<struct<infodate:Date,amount1:Integer,amount2:Integer,detail:string>>"))
    
    scala> df.printSchema
    root
     |-- id: string (nullable = true)
     |-- name: string (nullable = true)
     |-- history: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- infodate: date (nullable = true)
     |    |    |-- amount1: integer (nullable = true)
     |    |    |-- amount2: integer (nullable = true)
     |    |    |-- detail: string (nullable = true)

enter image description here

So the for the date 2019-01-26 , I want to remove the struct in which it is present so that it is removed from the array column.I want a solution like this.

enter image description here

I manage to find the solution but it involves lot of hardcoding and I'm searching for a solution/suggestion that is optimal.

Hardcoded solution:

val dfnew=df
 .withColumn( "history" , 
  array_except(
   col("history"),
   array(
    struct(
     lit("2019-01-26").cast(DataTypes.DateType).alias("infodate"),
     lit("200").cast(DataTypes.IntegerType).alias("amount1"), 
     lit("70").cast(DataTypes.IntegerType).alias("amount2"), 
     lit("DEF").alias("detail")
    )
   )
  )
 )

Is there any way of optimally doing it with one filter condition only on date "2019-01-26", which removes the struct/array from the array column.

1 Answer 1

1

I use an expression / filter here. Obviosuly it's a string so you can replace the date with a value so that there is even less hard coding. Filters are handy expressions as they let you use SQL notation to reference sub-components of the struct.

scala> :paste
// Entering paste mode (ctrl-D to finish)

df.withColumn( "history" , 
  expr( "filter( history , x -> x.infodate != '2019-01-26' )" )
 ).show(10,false)

// Exiting paste mode, now interpreting.

+---+------+--------------------------------------------------------------------------------------------------------------------------------------------+
|id |name  |history                                                                                                                                     |
+---+------+--------------------------------------------------------------------------------------------------------------------------------------------+
|123|Jack  |[[2020-04-26, 200, 72, ABC], [2020-05-26, 300, 71, ABC]]                                                                                    |
|124|jones |[[2020-04-26, 200, 72, ABC], [2020-05-26, 300, 71, ABC], [2020-06-26, 200, 70, ABC], [2020-08-26, 300, 69, ABC], [2020-08-26, 300, 69, ABC]]|
|125|daniel|[]                                                                                                                                          |
+---+------+--------------------------------------------------------------------------------------------------------------------------------------------+
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you @matt-andruff , is there any way we can filter out two dates instead of one using this filter expression, two dates - "2020-04-26" and "2019-01-26" . I tried out this code expr( "filter( (history , x -> x.infodate != '2019-01-26') & (history , x -> x.infodate != '2020-04-26') )" ) but didnt work.
I would think: expr( "filter( (history , x -> x.infodate != '2019-01-26' or x.infodate != '2020-04-26') )"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.