PySpark array_contains() Function with Examples

The PySpark array_contains() function is a SQL collection function that returns a boolean value indicating if an array-type column contains a specified element. It returns null if the array itself is null, true if the element exists, and false otherwise. This function can be applied to create a new boolean column or to filter rows in a DataFrame.

PySpark array_contains

The array_contains() function in PySpark is used to check whether a specific element exists in an array column. It returns a Boolean (True or False) for each row. This function is case-sensitive and works with ArrayType columns in a DataFrame.

Syntax of PySpark array_contains

The following is the syntax of the PySpark array_contains().


# syntax of the PySpark array_contains()
pyspark.sql.functions.array_contains(col, value)[source]

Parameters

col: The column having the arrays.

Value: The value or column to check for in the array.

Return Value

It returns a new Boolean column where each value indicates whether the corresponding array from the specified column contains the specified value.

Sample DataFrame

We’ll use the following DataFrame for all examples.


# Create DataFrame
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, ArrayType, StructType, StructField

# Initialize Spark session
spark = SparkSession.builder.appName("ArrayContainsExample").getOrCreate()

# Sample data
data = [
    ("James,,Smith", ["Java", "Scala", "C++"], ["Spark", "Java"], "OH", "CA"),
    ("Michael,Rose,", ["Spark", "Java", "C++"], ["Spark", "Java"], "NY", "NJ"),
    ("Robert,,Williams", ["CSharp", "VB"], ["Spark", "Python"], "UT", "NV")
]

# Schema
schema = StructType([
    StructField("name", StringType(), True),
    StructField("languagesAtSchool", ArrayType(StringType()), True),
    StructField("languagesAtWork", ArrayType(StringType()), True),
    StructField("currentState", StringType(), True),
    StructField("previousState", StringType(), True)
])

# Create DataFrame
df = spark.createDataFrame(data=data, schema=schema)
df.printSchema()
df.show(truncate=False)

Yields below the output.

PySpark Array Contains String

You can use the array_contains() function to check whether a specific value exists in an array. If the value is present, it returns true; otherwise, it returns false. By passing the array column and the target value, it creates a new DataFrame with a boolean column indicating the result.


# Check tha value in an array
from pyspark.sql.functions import array_contains

df.select("name", array_contains("languagesAtSchool", "Java")).show()

Yields below the output.

PySpark array_contains Multiple Values

You can check for multiple values in an array column by combining multiple array_contains() conditions using logical operators such as OR (|) or AND (&).


# Check multiple values 
df.select(
    "name",
    (
        array_contains("languagesAtSchool", "Java") |
        array_contains("languagesAtSchool", "Scala")
    )
).show()

Yields below the output.

PySpark Array Contains Null

To check if an array column has a NULL element, you can use the expr() function together with the exists() function.


# Check array contains null values
from pyspark.sql.functions import expr
data_with_null = [
    ("John", ["Java", None], ["Spark"], "TX", "CA")
]

df_null = spark.createDataFrame(data_with_null, schema=schema)

df_null.select(
    "name",
    expr("exists(languagesAtSchool, x -> x IS NULL)")).show()

Yields below the output.


# Output:
+----+-------------------------------------------------------------------------------------------------+
|name|exists(languagesAtSchool, lambdafunction((namedlambdavariable() IS NULL), namedlambdavariable()))|
+----+-------------------------------------------------------------------------------------------------+
|John|                                                                                             true|
+----+-------------------------------------------------------------------------------------------------+

PySpark Array Contains Filter

To filter DataFrame rows based on the existence of a specified value within an array-type column. To do so, you can use the array_contains() function directly inside the filter() method to select rows where the array column contains a specific value.


# Filter DataFrame using array_contains()
df.filter(array_contains("languagesAtSchool", "Java")).show()

Yields below the output.


# Output:
+-------------+------------------+---------------+------------+-------------+
|         name| languagesAtSchool|languagesAtWork|currentState|previousState|
+-------------+------------------+---------------+------------+-------------+
| James,,Smith|[Java, Scala, C++]|  [Spark, Java]|          OH|           CA|
|Michael,Rose,|[Spark, Java, C++]|  [Spark, Java]|          NY|           NJ|
+-------------+------------------+---------------+------------+-------------+

PySpark Array Contains Join

You can also use array_contains() in join conditions to connect two DataFrames based on whether an array column contains a given value.


# Join DataFrame using Join and array_contains
df2 = df.select("currentState").distinct()

joined_df = df.join(df2, array_contains("languagesAtSchool", "Java"))
joined_df.show()

Yields below the output.


# Output:
+-------------+------------------+---------------+------------+-------------+------------+
|         name| languagesAtSchool|languagesAtWork|currentState|previousState|currentState|
+-------------+------------------+---------------+------------+-------------+------------+
| James,,Smith|[Java, Scala, C++]|  [Spark, Java]|          OH|           CA|          OH|
| James,,Smith|[Java, Scala, C++]|  [Spark, Java]|          OH|           CA|          NY|
| James,,Smith|[Java, Scala, C++]|  [Spark, Java]|          OH|           CA|          UT|
|Michael,Rose,|[Spark, Java, C++]|  [Spark, Java]|          NY|           NJ|          OH|
|Michael,Rose,|[Spark, Java, C++]|  [Spark, Java]|          NY|           NJ|          NY|
|Michael,Rose,|[Spark, Java, C++]|  [Spark, Java]|          NY|           NJ|          UT|
+-------------+------------------+---------------+------------+-------------+------------+

Case-sensitive example with array_contains

You can use withColumn() with array_contains() to add new boolean columns that verify whether an array contains specific values, while also demonstrating its case sensitivity.


# Case-sensitive example with array_contains
df_result = df.withColumn(
    "has_java_school",
    array_contains(df.languagesAtSchool, "Java")
).withColumn(
    "has_python_school",
    array_contains(df.languagesAtSchool, "python")  # lowercase 'python'
)

df_result.show(truncate=False)

Yields below the output.

Since array_contains() is case-sensitive, "Java" matches successfully, while "python" (lowercase) does not.

Frequently Asked Questions of PySpark array_contains

What does array_contains() do in PySpark?

It checks whether a given value exists in an array column and returns true or false.

How can I check multiple values with array_contains()?

You cannot check multiple values in a single array_contains() call. Instead, you can combine multiple array_contains() expressions using | (OR) or & (AND).

How do I check for NULL values inside arrays?

Use expr("exists(arrayCol, x -> x IS NULL)") to check for NULL values inside arrays.

How can I use array_contains() in joins?

You can use array_contains() inside the join condition to match rows based on whether an array column contains a specific value

Conclusion

In this article, I have explained the array_contains() function in PySpark, which is a powerful tool for working with array columns. It allows you to perform a variety of operations such as checking for the presence of values, filtering rows, handling NULL entries, and even using it in joins with other DataFrames. This makes it highly flexible for manipulating and querying array-based data.

By combining array_contains() with logical operators (|, &) and functions like exists(), you can extend its capabilities to handle more complex scenarios commonly encountered in data engineering workflows.

Happy Learning!!

References

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_contains.html

PySpark array_contains

Syntax of PySpark array_contains

Parameters

Return Value

Sample DataFrame

PySpark Array Contains String

PySpark array_contains Multiple Values

PySpark Array Contains Null

PySpark Array Contains Filter

PySpark Array Contains Join

Case-sensitive example with array_contains

Frequently Asked Questions of PySpark array_contains

Conclusion

References

Related Articles