Filtering records in pyspark dataframe if the struct Array contains a record

Question

My Pyspark dataframe looks like this:

|-- name: string (nullable = true)
 |-- other_attr: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- value: string (nullable = true)

I am looking for the rows that don't have [Closed, Yes] in their array of struct under other_attr. The other_attr is an array of struct which could be an empty array. How could I run this filtering?

Reihan_amn · Accepted Answer · 2021-12-07 01:06:49Z

2

You can simply use array_contains to check against the struct [Closed, Yes] like so

import pyspark.sql.functions as F

df.show()
# +-----+---------------+
# | name|     other_attr|
# +-----+---------------+
# |test1|[{Closed, Yes}]|
# |test2| [{Closed, No}]|
# |test3|             []|
# +-----+---------------+

(df.where(~F
        .array_contains('other_attr', F.struct(
            F.lit('Closed').alias('key'),
            F.lit('Yes').alias('value'),
        ))
    ).show()
)

# Output
# +-----+--------------+
# | name|    other_attr|
# +-----+--------------+
# |test2|[{Closed, No}]|
# |test3|            []|
# +-----+--------------+

edited Dec 7, 2021 at 1:06

Reihan_amn

2,7672 gold badges24 silver badges23 bronze badges

answered Oct 22, 2021 at 18:12

pltc

6,0371 gold badge16 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Reihan_amn Over a year ago

TypeError: 'Column' object is not callable I get this error!

pltc Over a year ago

How your code looks like? I tested before posting as an answer, hence the output you above. (Also, IMO, shouldn't you discuss about the error before downvoting someone spending their time answering your question(s)?)

Reihan_amn Over a year ago

Sorry for the downvote! I mistakenly voted and wanted to remove the vote and this happen. Please edit the code so I can upvote you.

Drashti Dobariya · Accepted Answer · 2021-10-22 05:39:50Z

1

You can use to_json function with contains to filter rows based on criteria.

import pyspark.sql.functions as F

df2 = df.filter(
    ~F.to_json('other_attr').contains(
        F.to_json(
            F.struct(
                F.lit('Closed').alias('key'),
                F.lit('Yes').alias('value')
            )
        )
    )
)

answered Oct 22, 2021 at 5:39

Drashti Dobariya

3,0962 gold badges13 silver badges29 bronze badges

Comments

programort · Accepted Answer · 2022-08-11 08:23:57Z

0

If you want to code it with Spark SQL it is also possible:
Since Spark 2.4.0 you can use the functions exist.

Example with SparkSQL:

SELECT
    EXISTS
    (
        ARRAY(named_struct("key": "a", "value": "1"), named_struct("key": "b", "value": "2")),
        x -> x = named_struct("key": "a", "value": "1")
    )

Example with PySpark:

df.filter('exists(extra_features, x -> x = named_struct("key": "a", "value": "1"))')

Note that not all the functions to manipulate arrays start with array_*.
Ex: exist, filter, size, ...

answered Aug 11, 2022 at 8:23

programort

1595 bronze badges

Collectives™ on Stack Overflow

Filtering records in pyspark dataframe if the struct Array contains a record

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related