• Post author:
  • Post category:PySpark
  • Post last modified:August 21, 2025
  • Reading time:16 mins read
You are currently viewing PySpark array_contains() Function with Examples

The PySpark array_contains() function is a SQL collection function that returns a boolean value indicating if an array-type column contains a specified element. It returns null if the array itself is null, true if the element exists, and false otherwise. This function can be applied to create a new boolean column or to filter rows in a DataFrame.

Advertisements

In this article, I will explain how to use the array_contains() function with different examples, including single values, multiple values, NULL checks, filtering, and joins.

Key Points-

  • Function Purpose – array_contains() checks if a given element exists in an array column and returns a boolean result.
  • Return Values – Returns true if the element is found, false if not, and null if the array itself is null.
  • Column Type Restriction – Works only with ArrayType columns in a DataFrame.
  • Case Sensitivity – The function is case-sensitive, so "Java" and "java" are treated as different values.
  • Syntax – The syntax is array_contains(col, value), where col is the array column and value is the element to check.
  • Create New Columns – Can be used inside withColumn() to add boolean columns that indicate the presence of values in arrays.
  • Filtering Rows – Can be applied inside filter() to select rows where an array column contains a specific element.
  • Multiple Values – You cannot check multiple values in one call, but you can combine multiple array_contains() conditions with OR (|) and AND (&).
  • Handling NULLs – To check if an array contains NULL, you can use expr() with exists().
  • Usage in Joins – array_contains() can also be used in join conditions to connect DataFrames based on array values.

PySpark array_contains

The array_contains() function in PySpark is used to check whether a specific element exists in an array column. It returns a Boolean (True or False) for each row. This function is case-sensitive and works with ArrayType columns in a DataFrame.

Syntax of PySpark array_contains

The following is the syntax of the PySpark array_contains().


# syntax of the PySpark array_contains()
pyspark.sql.functions.array_contains(col, value)[source]

Parameters

  • col: The column having the arrays.
  • Value: The value or column to check for in the array.

Return Value

It returns a new Boolean column where each value indicates whether the corresponding array from the specified column contains the specified value.

Sample DataFrame

We’ll use the following DataFrame for all examples.


# Create DataFrame
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, ArrayType, StructType, StructField

# Initialize Spark session
spark = SparkSession.builder.appName("ArrayContainsExample").getOrCreate()

# Sample data
data = [
    ("James,,Smith", ["Java", "Scala", "C++"], ["Spark", "Java"], "OH", "CA"),
    ("Michael,Rose,", ["Spark", "Java", "C++"], ["Spark", "Java"], "NY", "NJ"),
    ("Robert,,Williams", ["CSharp", "VB"], ["Spark", "Python"], "UT", "NV")
]

# Schema
schema = StructType([
    StructField("name", StringType(), True),
    StructField("languagesAtSchool", ArrayType(StringType()), True),
    StructField("languagesAtWork", ArrayType(StringType()), True),
    StructField("currentState", StringType(), True),
    StructField("previousState", StringType(), True)
])

# Create DataFrame
df = spark.createDataFrame(data=data, schema=schema)
df.printSchema()
df.show(truncate=False)

Yields below the output.

PySpark array contains

PySpark Array Contains String

You can use the array_contains() function to check whether a specific value exists in an array. If the value is present, it returns true; otherwise, it returns false. By passing the array column and the target value, it creates a new DataFrame with a boolean column indicating the result.


# Check tha value in an array
from pyspark.sql.functions import array_contains

df.select("name", array_contains("languagesAtSchool", "Java")).show()

Yields below the output.

PySpark array contains

PySpark array_contains Multiple Values

You can check for multiple values in an array column by combining multiple array_contains() conditions using logical operators such as OR (|) or AND (&).


# Check multiple values 
df.select(
    "name",
    (
        array_contains("languagesAtSchool", "Java") |
        array_contains("languagesAtSchool", "Scala")
    )
).show()

Yields below the output.

PySpark array contains

PySpark Array Contains Null

To check if an array column has a NULL element, you can use the expr() function together with the exists() function.


# Check array contains null values
from pyspark.sql.functions import expr
data_with_null = [
    ("John", ["Java", None], ["Spark"], "TX", "CA")
]

df_null = spark.createDataFrame(data_with_null, schema=schema)

df_null.select(
    "name",
    expr("exists(languagesAtSchool, x -> x IS NULL)")).show()

Yields below the output.


# Output:
+----+-------------------------------------------------------------------------------------------------+
|name|exists(languagesAtSchool, lambdafunction((namedlambdavariable() IS NULL), namedlambdavariable()))|
+----+-------------------------------------------------------------------------------------------------+
|John|                                                                                             true|
+----+-------------------------------------------------------------------------------------------------+

PySpark Array Contains Filter

To filter DataFrame rows based on the existence of a specified value within an array-type column. To do so, you can use the array_contains() function directly inside the filter() method to select rows where the array column contains a specific value.


# Filter DataFrame using array_contains()
df.filter(array_contains("languagesAtSchool", "Java")).show()

Yields below the output.


# Output:
+-------------+------------------+---------------+------------+-------------+
|         name| languagesAtSchool|languagesAtWork|currentState|previousState|
+-------------+------------------+---------------+------------+-------------+
| James,,Smith|[Java, Scala, C++]|  [Spark, Java]|          OH|           CA|
|Michael,Rose,|[Spark, Java, C++]|  [Spark, Java]|          NY|           NJ|
+-------------+------------------+---------------+------------+-------------+

Related: You can also filter the DataFrame using the contains() function.

PySpark Array Contains Join

You can also use array_contains() in join conditions to connect two DataFrames based on whether an array column contains a given value.


# Join DataFrame using Join and array_contains
df2 = df.select("currentState").distinct()

joined_df = df.join(df2, array_contains("languagesAtSchool", "Java"))
joined_df.show()

Yields below the output.


# Output:
+-------------+------------------+---------------+------------+-------------+------------+
|         name| languagesAtSchool|languagesAtWork|currentState|previousState|currentState|
+-------------+------------------+---------------+------------+-------------+------------+
| James,,Smith|[Java, Scala, C++]|  [Spark, Java]|          OH|           CA|          OH|
| James,,Smith|[Java, Scala, C++]|  [Spark, Java]|          OH|           CA|          NY|
| James,,Smith|[Java, Scala, C++]|  [Spark, Java]|          OH|           CA|          UT|
|Michael,Rose,|[Spark, Java, C++]|  [Spark, Java]|          NY|           NJ|          OH|
|Michael,Rose,|[Spark, Java, C++]|  [Spark, Java]|          NY|           NJ|          NY|
|Michael,Rose,|[Spark, Java, C++]|  [Spark, Java]|          NY|           NJ|          UT|
+-------------+------------------+---------------+------------+-------------+------------+

Case-sensitive example with array_contains

You can use withColumn() with array_contains() to add new boolean columns that verify whether an array contains specific values, while also demonstrating its case sensitivity.


# Case-sensitive example with array_contains
df_result = df.withColumn(
    "has_java_school",
    array_contains(df.languagesAtSchool, "Java")
).withColumn(
    "has_python_school",
    array_contains(df.languagesAtSchool, "python")  # lowercase 'python'
)

df_result.show(truncate=False)

Yields below the output.

PySpark array contains

Since array_contains() is case-sensitive, "Java" matches successfully, while "python" (lowercase) does not.

Frequently Asked Questions of PySpark array_contains

What does array_contains() do in PySpark?

It checks whether a given value exists in an array column and returns true or false.

How can I check multiple values with array_contains()?

You cannot check multiple values in a single array_contains() call. Instead, you can combine multiple array_contains() expressions using | (OR) or & (AND).

How do I check for NULL values inside arrays?

Use expr("exists(arrayCol, x -> x IS NULL)") to check for NULL values inside arrays.

How can I use array_contains() in joins?

You can use array_contains() inside the join condition to match rows based on whether an array column contains a specific value

Conclusion

In this article, I have explained the array_contains() function in PySpark, which is a powerful tool for working with array columns. It allows you to perform a variety of operations such as checking for the presence of values, filtering rows, handling NULL entries, and even using it in joins with other DataFrames. This makes it highly flexible for manipulating and querying array-based data.

By combining array_contains() with logical operators (|, &) and functions like exists(), you can extend its capabilities to handle more complex scenarios commonly encountered in data engineering workflows.

Happy Learning!!

References

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_contains.html