0

I have a col in a dataframe which is an array of structs. There are some structs with all null values which I would like to filter out. For example with the following dataframe:

+—————+————————————————---------------------—+
|advertiser      |brands                     |
+—————+—————————————---------------------————+
Advertiser1      [{"id" : "a", "type" : "b", "name" : "c"}]
Advertiser2      [{"id" : null, "type" : null, "name" : null}]
+—————+————————————---------------------—————+

I would like to filter out the struct with the null values to get:

+—————+———————————————---------------------——+
|advertiser      |brands                     |
+—————+————————————---------------------—————+
Advertiser1      [{"id" : "a", "type" : "b", "name" : "c"}]
Advertiser2      []
+—————+—————————————---------------------————+

I'm thinking it's something along the lines of this if I can come up with a struct of null values:

.withColumn(
    "brands",
    when(
        col("brands").equalTo(*emptyStruct?*),
            null
    )
)

2 Answers 2

1

You can try to use the to_json function, brands with all null values will returns [{}].

Sign up to request clarification or add additional context in comments.

Comments

0

You want to filter the array elements. Since Spark 3.0 there is a method in org.apache.spark.sql.functions with the signature:

filter(column: Column, f: Column => Column): Column

which does this for you:

df.select(
  col("advertiser"),
  filter(col("brands"),
         b =>
           b.getField("id").isNotNull
             && b.getField("type").isNotNull
             && b.getField("name").isNotNull) as "brands"

Using Spark 1.6+ you can explode the array, filter the structs and then group by advertiser:

df.select(col("advertiser"), explode(col("brands")) as "b").
  filter(
    col("b.id").isNotNull &&
      col("b.type").isNotNull &&
      col("b.name").isNotNull)
  .groupBy("advertiser").agg(collect_list("b") as "brands")

2 Comments

Hmm that syntax doesn't seem to work for me. I am getting error: not found: value filter. Should it be col("brands").filter() maybe?
It might be the Spark version you are using, as the error implies the function cannot be found. The answer I posted works on Spark 3.+.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.