1

My Pyspark dataframe looks like this:

|-- name: string (nullable = true)
 |-- other_attr: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- value: string (nullable = true)

I am looking for the rows that don't have [Closed, Yes] in their array of struct under other_attr. The other_attr is an array of struct which could be an empty array. How could I run this filtering?

3 Answers 3

2

You can simply use array_contains to check against the struct [Closed, Yes] like so

import pyspark.sql.functions as F

df.show()
# +-----+---------------+
# | name|     other_attr|
# +-----+---------------+
# |test1|[{Closed, Yes}]|
# |test2| [{Closed, No}]|
# |test3|             []|
# +-----+---------------+

(df.where(~F
        .array_contains('other_attr', F.struct(
            F.lit('Closed').alias('key'),
            F.lit('Yes').alias('value'),
        ))
    ).show()
)

# Output
# +-----+--------------+
# | name|    other_attr|
# +-----+--------------+
# |test2|[{Closed, No}]|
# |test3|            []|
# +-----+--------------+
Sign up to request clarification or add additional context in comments.

3 Comments

TypeError: 'Column' object is not callable I get this error!
How your code looks like? I tested before posting as an answer, hence the output you above. (Also, IMO, shouldn't you discuss about the error before downvoting someone spending their time answering your question(s)?)
Sorry for the downvote! I mistakenly voted and wanted to remove the vote and this happen. Please edit the code so I can upvote you.
1

You can use to_json function with contains to filter rows based on criteria.

import pyspark.sql.functions as F

df2 = df.filter(
    ~F.to_json('other_attr').contains(
        F.to_json(
            F.struct(
                F.lit('Closed').alias('key'),
                F.lit('Yes').alias('value')
            )
        )
    )
)

Comments

0

If you want to code it with Spark SQL it is also possible:
Since Spark 2.4.0 you can use the functions exist.

Example with SparkSQL:

SELECT
    EXISTS
    (
        ARRAY(named_struct("key": "a", "value": "1"), named_struct("key": "b", "value": "2")),
        x -> x = named_struct("key": "a", "value": "1")
    )

Example with PySpark:

df.filter('exists(extra_features, x -> x = named_struct("key": "a", "value": "1"))')

Note that not all the functions to manipulate arrays start with array_*.
Ex: exist, filter, size, ...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.