The first_value() function in PySpark is a window function that returns the first value of a column within a window partition, based on the specified ordering. Unlike the aggregate first() function, which returns the first element of a column or group, first_value() is used with Window specifications and works row-by-row, making it suitable for advanced analytical queries.
This function is useful in scenarios such as:
- Fetching the first value in each partition or group.
- Returning the earliest record in an ordered dataset.
- Handling null values with the
ignorenullsparameter. - Summarizing datasets where only the first meaningful value is required.
In this article, we’ll explore:
- What is the PySpark
first_value()function - Syntax and parameters
- Return value
- Usage with and without partitions
- Handling nulls using
ignorenulls - Frequently Asked Questions (FAQs)
- Key points
Key Points-
first_value()is a window function, unlikefirst()which is an aggregate function.- Available in the pyspark.sql.functions module.
- Requires a Window specification to operate.
- By default, nulls are included, and the first null will be returned.
- The
ignorenulls=Trueparameter can skip nulls and return the first non-null value. - Works both with and without partitions (via
.partitionBy()). - Ordering must be explicitly defined using
.orderBy()for deterministic results. - Returns
nullif no non-null values exist in the partition. - Useful in analytics to fetch the first meaningful record from partitions or datasets.
- Ensures efficient distributed execution in large-scale Spark environments.
PySpark first_value() Function
The first_value() function belongs to the pyspark.sql.functions module. It returns the first value of a column in an ordered group of rows. It is often used along with Window.partitionBy() and Window.orderBy() to compute results within partitions (groups) or across the entire DataFrame.
Syntax of PySpark first_value()
The syntax of the first_value() function.
# Syntax of first_value()
pyspark.sql.functions.first_value(col, ignorenulls=False)
Parameters
col:The column or expression on which the function operates.ignorenulls (optional):Boolean flag to control null handling.False (default):Considers null values. If the first value is null, it returns null.True:Ignores nulls and returns the first non-null value.
Return Value
The function returns the first value in the window partition.
- The result has the same data type as the input column.
- If all values are null in the partition, the result is null (even if
ignorenulls=True).
Use first_value() With Partitioning
We can select the first value from each group using the PySpark DataFrame API. In this section, we will see how to use the window function first_value() with partitionBy(). Let’s create a DataFrame and define a window specification with partitionBy() and orderBy() to get the first value of each group based on the specified ordering.
# Create dataFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import first_value
from pyspark.sql.window import Window
# Initialize Spark session
spark = SparkSession.builder.appName("FirstValueExample").getOrCreate()
# Sample data
data = [
("James", "Sales", 3000),
("Michael", "Sales", 4600),
("Robert", "Sales", 4100),
("Maria", "Finance", None),
("Jen", "Finance", 3000),
("Jeff", "Marketing", 3000)
]
columns = ["employee_name", "department", "salary"]
# Assign DataFrame to df
df = spark.createDataFrame(data, columns)
# Only call show() to display
df.show()
# Define Window
windowSpec = Window.partitionBy("department").orderBy("salary")
# Apply first_value
df.withColumn("first_salary", first_value("salary").over(windowSpec)).show()
Yields below output.

A step-by-step breakdown after the code:
- Partition the DataFrame on the department column, which groups all same departments together.
- Apply orderBy() on the
salarycolumn inside each partition. - Add a new column by running
first_value("salary")over the window. - For each department group, it selects the first salary based on the ordering.
- The same first salary is assigned to all rows in that partition.
Use first_value with ignorenulls
We can select the first non-null value from each group by using first_value() with the ignoreNulls parameter.
# Use first_value with ignorenulls
df.withColumn(
"first_salary_non_null",
first_value("salary", True).over(windowSpec)
).show()
Yields below output.

A step-by-step breakdown after the code:
- Partition the DataFrame by the department column.
- Apply orderBy() on the
salarycolumn. - Normally, if the first row in the order has a
nullsalary, then the result isnullfor all rows in that partition. - By enabling
ignoreNulls=True, the function skipsnullvalues and picks the next available non-null salary. - Each department gets the first non-null salary instead of returning nulls.
Use first_value() Without Partitioning
We can also use first_value() without partitioning, in this case, it returns the first value from the entire DataFrame after ordering.
# Define Window WITHOUT partitioning
windowSpec = Window.orderBy("salary")
# Apply first_value
df.withColumn("first_salary_global", first_value("salary").over(windowSpec)).show()
Yields below output.
# Output:
+-------------+----------+-----+------------------+
| employee_name|department|salary|first_salary_global|
+-------------+----------+-----+------------------+
| James | Sales | 3000 | 3000 |
| Jen | Finance | 3000 | 3000 |
| Jeff | Marketing| 3000 | 3000 |
| Robert | Sales | 4100 | 3000 |
| Michael | Sales | 4600 | 3000 |
| Maria | Finance | null | 3000 |
+-------------+----------+-----+------------------+
A step-by-step breakdown after the code:
- No partitioning is applied; the entire DataFrame is treated as one group.
- Apply orderBy() on the
salarycolumn across the DataFrame. - The first salary in the global order is selected.
- That value is then assigned to all rows in the DataFrame.
Difference Between first() and first_value()
The following table shows the difference between first() and first_value().
| Feature | first() | first_value() |
|---|---|---|
| Type | Aggregate function | Window function |
| Module | pyspark.sql.functions | pyspark.sql.functions |
| Usage | Used with groupBy() or as an aggregate to fetch the first element of a column or group | Used with Window specifications to get the first value within an ordered partition |
| Ordering | Does not guarantee order unless combined with .orderBy() | Requires .orderBy() for deterministic results |
| Null Handling | Supports ignorenulls parameter | Supports ignorenulls parameter |
| Scope | Operates on entire DataFrame or groups | Operates row-by-row within partitions or globally with Window |
| Return Type | Same data type as input column | Same data type as input column |
Frequently Asked Questions of PySpark first_value() Function
first() is an aggregate function that returns the first element of a group.first_value() is a window function that returns the first value in a window or partition after applying ordering.
By default, if the first row contains a null, first_value() will return null. You can set ignorenulls=True to skip nulls and return the next available non-null value.
you can define a window with only orderBy() (without partitionBy()), and first_value() will return the first value across the entire dataset.
The return type is the same as the input column’s data type. For example, if the column is of type Integer, the result will also be Integer.
You should use first_value() when you need the first ordered value within a window or partition (e.g., first salary in each department, first transaction per customer).
Conclusion
In this article, we explored the PySpark first_value() function, its syntax, parameters, return type, and how it differs from the aggregate first() function. We also discussed how to use it with partitions, handle nulls with ignorenulls, and common FAQs.
By combining first_value() with Window specifications and ordering, you can retrieve the first meaningful value per partition or across datasets — a powerful tool for analytics and reporting in PySpark pipelines.
Happy Learning!!