Explain PySpark first_value() Function with Examples

The first_value() function in PySpark is a window function that returns the first value of a column within a window partition, based on the specified ordering. Unlike the aggregate first() function, which returns the first element of a column or group, first_value() is used with Window specifications and works row-by-row, making it suitable for advanced analytical queries.

PySpark first_value() Function

The first_value() function belongs to the pyspark.sql.functions module. It returns the first value of a column in an ordered group of rows. It is often used along with Window.partitionBy() and Window.orderBy() to compute results within partitions (groups) or across the entire DataFrame.

Syntax of PySpark first_value()

The syntax of the first_value() function.


# Syntax of first_value()
pyspark.sql.functions.first_value(col, ignorenulls=False)

Parameters

col: The column or expression on which the function operates.
ignorenulls (optional): Boolean flag to control null handling.
- False (default): Considers null values. If the first value is null, it returns null.
- True: Ignores nulls and returns the first non-null value.

Return Value

The function returns the first value in the window partition.

The result has the same data type as the input column.
If all values are null in the partition, the result is null (even if ignorenulls=True).

Use first_value() With Partitioning

We can select the first value from each group using the PySpark DataFrame API. In this section, we will see how to use the window function first_value() with partitionBy(). Let’s create a DataFrame and define a window specification with partitionBy() and orderBy() to get the first value of each group based on the specified ordering.


# Create dataFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import first_value
from pyspark.sql.window import Window

# Initialize Spark session
spark = SparkSession.builder.appName("FirstValueExample").getOrCreate()

# Sample data
data = [
    ("James", "Sales", 3000),
    ("Michael", "Sales", 4600),
    ("Robert", "Sales", 4100),
    ("Maria", "Finance", None),
    ("Jen", "Finance", 3000),
    ("Jeff", "Marketing", 3000)
]

columns = ["employee_name", "department", "salary"]

# Assign DataFrame to df
df = spark.createDataFrame(data, columns)

# Only call show() to display
df.show()


# Define Window
windowSpec = Window.partitionBy("department").orderBy("salary")

# Apply first_value
df.withColumn("first_salary", first_value("salary").over(windowSpec)).show()

Yields below output.

A step-by-step breakdown after the code:

Partition the DataFrame on the department column, which groups all same departments together.
Apply orderBy() on the salary column inside each partition.
Add a new column by running first_value("salary") over the window.
For each department group, it selects the first salary based on the ordering.
The same first salary is assigned to all rows in that partition.

Use first_value with ignorenulls

We can select the first non-null value from each group by using first_value() with the ignoreNulls parameter.


# Use first_value with ignorenulls
df.withColumn(
    "first_salary_non_null",
    first_value("salary", True).over(windowSpec)
).show()

Yields below output.

A step-by-step breakdown after the code:

Partition the DataFrame by the department column.
Apply orderBy() on the salary column.
Normally, if the first row in the order has a null salary, then the result is null for all rows in that partition.
By enabling ignoreNulls=True, the function skips null values and picks the next available non-null salary.
Each department gets the first non-null salary instead of returning nulls.

Use first_value() Without Partitioning

We can also use first_value() without partitioning, in this case, it returns the first value from the entire DataFrame after ordering.


# Define Window WITHOUT partitioning
windowSpec = Window.orderBy("salary")

# Apply first_value
df.withColumn("first_salary_global", first_value("salary").over(windowSpec)).show()

Yields below output.


# Output:
+-------------+----------+-----+------------------+
| employee_name|department|salary|first_salary_global|
+-------------+----------+-----+------------------+
| James       | Sales    | 3000 | 3000             |
| Jen         | Finance  | 3000 | 3000             |
| Jeff        | Marketing| 3000 | 3000             |
| Robert      | Sales    | 4100 | 3000             |
| Michael     | Sales    | 4600 | 3000             |
| Maria       | Finance  | null | 3000             |
+-------------+----------+-----+------------------+

A step-by-step breakdown after the code:

No partitioning is applied; the entire DataFrame is treated as one group.
Apply orderBy() on the salary column across the DataFrame.
The first salary in the global order is selected.
That value is then assigned to all rows in the DataFrame.

Difference Between first() and first_value()

The following table shows the difference between first() and first_value().

Feature	first()	first_value()
Type	Aggregate function	Window function
Module	`pyspark.sql.functions`	`pyspark.sql.functions`
Usage	Used with `groupBy()` or as an aggregate to fetch the first element of a column or group	Used with `Window` specifications to get the first value within an ordered partition
Ordering	Does not guarantee order unless combined with `.orderBy()`	Requires `.orderBy()` for deterministic results
Null Handling	Supports `ignorenulls` parameter	Supports `ignorenulls` parameter
Scope	Operates on entire DataFrame or groups	Operates row-by-row within partitions or globally with Window
Return Type	Same data type as input column	Same data type as input column

Frequently Asked Questions of PySpark first_value() Function

What is the difference between first() and first_value() in PySpark?

first() is an aggregate function that returns the first element of a group.
first_value() is a window function that returns the first value in a window or partition after applying ordering.

How does first_value() handle null values?

By default, if the first row contains a null, first_value() will return null. You can set ignorenulls=True to skip nulls and return the next available non-null value.

How can I use first_value() without partitioning?

you can define a window with only orderBy() (without partitionBy()), and first_value() will return the first value across the entire dataset.

What is the return type of first_value()?

The return type is the same as the input column’s data type. For example, if the column is of type Integer, the result will also be Integer.

When should I use first_value() in PySpark?

You should use first_value() when you need the first ordered value within a window or partition (e.g., first salary in each department, first transaction per customer).

Conclusion

In this article, we explored the PySpark first_value() function, its syntax, parameters, return type, and how it differs from the aggregate first() function. We also discussed how to use it with partitions, handle nulls with ignorenulls, and common FAQs.

By combining first_value() with Window specifications and ordering, you can retrieve the first meaningful value per partition or across datasets — a powerful tool for analytics and reporting in PySpark pipelines.

Happy Learning!!

References

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.first_value.html