How to add a pyspark rolling window based on restricted duplicate values

Question

I have a dataframe like this:

Reproduce:

df = spark.createDataFrame([(1, 4, 3), (2, 4, 2), (3, 4, 5), (1, 5, 3), (2, 5, 2), (3, 6, 5)], ['a', 'b', 'c'])

I want to restrict the duplicates of column 'b' to two, only two duplicates will be kept, rest will be dropped. After that, I want to add a new column as 'd', where there will be a rolling window of numeric values in Ascending order as 1,2 like:

Is there anything like pandas rolling window equivalent in Pyspark which I have failed to dig out from Stack Overflow and documentation where I can do something like what I may have done on pandas:

y1 = y[df.COL3 == 'b']
y1 = y1.rolling(window).apply(lambda x: np.max(x) if len(x)>0 else 0).fillna('drop')y = y1.reindex(y.index, fill_value = 0).loc[lambda x : x!='drop']

I am new to PySpark, thanks in advance.

that's just arow_number. you filter on your row number < 3. — Steven
– Steven, Commented Jan 9, 2023 at 17:19

Abdennacer Lachiheb · Accepted Answer · 2023-01-09 17:29:55Z

2

You can create a Window, partition by column b, do row_numner on that window and filter the row numbers less or equal 2:

# Prepare data:
from pyspark.sql.functions import row_number
from pyspark.sql import SparkSession, Window
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = spark.createDataFrame([(1, 4, 3), (2, 4, 2), (3, 4, 5), (1, 5, 3), (2, 5, 2), (3, 6, 5)], ['a', 'b', 'c'])

# Actual work
w = Window.partitionBy(col("b")).orderBy(col("b"))
df.withColumn("d", row_number().over(w)).filter(col("d") <= 2).show()

+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  1|  4|  3|  1|
|  2|  4|  2|  2|
|  1|  5|  3|  1|
|  2|  5|  2|  2|
|  3|  6|  5|  1|
+---+---+---+---+

answered Jan 9, 2023 at 17:29

Abdennacer Lachiheb

4,9689 gold badges35 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Strayhorn Over a year ago

That certainly did the job. Interesting how Pyspark works with row_number. Thanks a lot!

Abdennacer Lachiheb Over a year ago

@Strayhorn your welcome, I recommend you to check how windows works in spark, it's not very difficult usually it contains partitionBy to create a partition per group of data having the same key then a orderBy to sort data inside each partition, then you can perform some action on each partition like row_number, sum, min, max ...

Collectives™ on Stack Overflow

How to add a pyspark rolling window based on restricted duplicate values

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related