• Post author:
  • Post category:PySpark
  • Post last modified:September 8, 2025
  • Reading time:17 mins read
You are currently viewing Explain PySpark first_value() Function with Examples

The first_value() function in PySpark is a window function that returns the first value of a column within a window partition, based on the specified ordering. Unlike the aggregate first() function, which returns the first element of a column or group, first_value() is used with Window specifications and works row-by-row, making it suitable for advanced analytical queries.

Advertisements

This function is useful in scenarios such as:

  • Fetching the first value in each partition or group.
  • Returning the earliest record in an ordered dataset.
  • Handling null values with the ignorenulls parameter.
  • Summarizing datasets where only the first meaningful value is required.

In this article, we’ll explore:

  • What is the PySpark first_value() function
  • Syntax and parameters
  • Return value
  • Usage with and without partitions
  • Handling nulls using ignorenulls
  • Frequently Asked Questions (FAQs)
  • Key points

Key Points-

  • first_value() is a window function, unlike first() which is an aggregate function.
  • Available in the pyspark.sql.functions module.
  • Requires a Window specification to operate.
  • By default, nulls are included, and the first null will be returned.
  • The ignorenulls=True parameter can skip nulls and return the first non-null value.
  • Works both with and without partitions (via .partitionBy()).
  • Ordering must be explicitly defined using .orderBy() for deterministic results.
  • Returns null if no non-null values exist in the partition.
  • Useful in analytics to fetch the first meaningful record from partitions or datasets.
  • Ensures efficient distributed execution in large-scale Spark environments.

PySpark first_value() Function

The first_value() function belongs to the pyspark.sql.functions module. It returns the first value of a column in an ordered group of rows. It is often used along with Window.partitionBy() and Window.orderBy() to compute results within partitions (groups) or across the entire DataFrame.

Syntax of PySpark first_value()

The syntax of the first_value() function.


# Syntax of first_value()
pyspark.sql.functions.first_value(col, ignorenulls=False)

Parameters

  • col: The column or expression on which the function operates.
  • ignorenulls (optional): Boolean flag to control null handling.
    • False (default): Considers null values. If the first value is null, it returns null.
    • True: Ignores nulls and returns the first non-null value.

Return Value

The function returns the first value in the window partition.

  • The result has the same data type as the input column.
  • If all values are null in the partition, the result is null (even if ignorenulls=True).

Use first_value() With Partitioning

We can select the first value from each group using the PySpark DataFrame API. In this section, we will see how to use the window function first_value() with partitionBy(). Let’s create a DataFrame and define a window specification with partitionBy() and orderBy() to get the first value of each group based on the specified ordering.


# Create dataFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import first_value
from pyspark.sql.window import Window

# Initialize Spark session
spark = SparkSession.builder.appName("FirstValueExample").getOrCreate()

# Sample data
data = [
    ("James", "Sales", 3000),
    ("Michael", "Sales", 4600),
    ("Robert", "Sales", 4100),
    ("Maria", "Finance", None),
    ("Jen", "Finance", 3000),
    ("Jeff", "Marketing", 3000)
]

columns = ["employee_name", "department", "salary"]

# Assign DataFrame to df
df = spark.createDataFrame(data, columns)

# Only call show() to display
df.show()


# Define Window
windowSpec = Window.partitionBy("department").orderBy("salary")

# Apply first_value
df.withColumn("first_salary", first_value("salary").over(windowSpec)).show()

Yields below output.

PySpark first_value()

A step-by-step breakdown after the code:

  • Partition the DataFrame on the department column, which groups all same departments together.
  • Apply orderBy() on the salary column inside each partition.
  • Add a new column by running first_value("salary") over the window.
  • For each department group, it selects the first salary based on the ordering.
  • The same first salary is assigned to all rows in that partition.

Use first_value with ignorenulls

We can select the first non-null value from each group by using first_value() with the ignoreNulls parameter.


# Use first_value with ignorenulls
df.withColumn(
    "first_salary_non_null",
    first_value("salary", True).over(windowSpec)
).show()

Yields below output.

PySpark first_value()

A step-by-step breakdown after the code:

  • Partition the DataFrame by the department column.
  • Apply orderBy() on the salary column.
  • Normally, if the first row in the order has a null salary, then the result is null for all rows in that partition.
  • By enabling ignoreNulls=True, the function skips null values and picks the next available non-null salary.
  • Each department gets the first non-null salary instead of returning nulls.

Use first_value() Without Partitioning

We can also use first_value() without partitioning, in this case, it returns the first value from the entire DataFrame after ordering.


# Define Window WITHOUT partitioning
windowSpec = Window.orderBy("salary")

# Apply first_value
df.withColumn("first_salary_global", first_value("salary").over(windowSpec)).show()

Yields below output.


# Output:
+-------------+----------+-----+------------------+
| employee_name|department|salary|first_salary_global|
+-------------+----------+-----+------------------+
| James       | Sales    | 3000 | 3000             |
| Jen         | Finance  | 3000 | 3000             |
| Jeff        | Marketing| 3000 | 3000             |
| Robert      | Sales    | 4100 | 3000             |
| Michael     | Sales    | 4600 | 3000             |
| Maria       | Finance  | null | 3000             |
+-------------+----------+-----+------------------+

A step-by-step breakdown after the code:

  • No partitioning is applied; the entire DataFrame is treated as one group.
  • Apply orderBy() on the salary column across the DataFrame.
  • The first salary in the global order is selected.
  • That value is then assigned to all rows in the DataFrame.

Difference Between first() and first_value()

The following table shows the difference between first() and first_value().

Featurefirst()first_value()
TypeAggregate functionWindow function
Modulepyspark.sql.functionspyspark.sql.functions
UsageUsed with groupBy() or as an aggregate to fetch the first element of a column or groupUsed with Window specifications to get the first value within an ordered partition
OrderingDoes not guarantee order unless combined with .orderBy()Requires .orderBy() for deterministic results
Null HandlingSupports ignorenulls parameterSupports ignorenulls parameter
ScopeOperates on entire DataFrame or groupsOperates row-by-row within partitions or globally with Window
Return TypeSame data type as input columnSame data type as input column

Frequently Asked Questions of PySpark first_value() Function

What is the difference between first() and first_value() in PySpark?

first() is an aggregate function that returns the first element of a group.
first_value() is a window function that returns the first value in a window or partition after applying ordering.

How does first_value() handle null values?

By default, if the first row contains a null, first_value() will return null. You can set ignorenulls=True to skip nulls and return the next available non-null value.

How can I use first_value() without partitioning?

you can define a window with only orderBy() (without partitionBy()), and first_value() will return the first value across the entire dataset.

What is the return type of first_value()?

The return type is the same as the input column’s data type. For example, if the column is of type Integer, the result will also be Integer.

When should I use first_value() in PySpark?

You should use first_value() when you need the first ordered value within a window or partition (e.g., first salary in each department, first transaction per customer).

Conclusion

In this article, we explored the PySpark first_value() function, its syntax, parameters, return type, and how it differs from the aggregate first() function. We also discussed how to use it with partitions, handle nulls with ignorenulls, and common FAQs.

By combining first_value() with Window specifications and ordering, you can retrieve the first meaningful value per partition or across datasets — a powerful tool for analytics and reporting in PySpark pipelines.

Happy Learning!!

References