1

I have a dataframe like this: enter image description here

Reproduce:

df = spark.createDataFrame([(1, 4, 3), (2, 4, 2), (3, 4, 5), (1, 5, 3), (2, 5, 2), (3, 6, 5)], ['a', 'b', 'c'])

I want to restrict the duplicates of column 'b' to two, only two duplicates will be kept, rest will be dropped. After that, I want to add a new column as 'd', where there will be a rolling window of numeric values in Ascending order as 1,2 like:

enter image description here

Is there anything like pandas rolling window equivalent in Pyspark which I have failed to dig out from Stack Overflow and documentation where I can do something like what I may have done on pandas:

y1 = y[df.COL3 == 'b']
y1 = y1.rolling(window).apply(lambda x: np.max(x) if len(x)>0 else 0).fillna('drop')y = y1.reindex(y.index, fill_value = 0).loc[lambda x : x!='drop']

I am new to PySpark, thanks in advance.

1
  • that's just arow_number. you filter on your row number < 3. Commented Jan 9, 2023 at 17:19

1 Answer 1

2

You can create a Window, partition by column b, do row_numner on that window and filter the row numbers less or equal 2:

# Prepare data:
from pyspark.sql.functions import row_number
from pyspark.sql import SparkSession, Window
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = spark.createDataFrame([(1, 4, 3), (2, 4, 2), (3, 4, 5), (1, 5, 3), (2, 5, 2), (3, 6, 5)], ['a', 'b', 'c'])

# Actual work
w = Window.partitionBy(col("b")).orderBy(col("b"))
df.withColumn("d", row_number().over(w)).filter(col("d") <= 2).show()

+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  1|  4|  3|  1|
|  2|  4|  2|  2|
|  1|  5|  3|  1|
|  2|  5|  2|  2|
|  3|  6|  5|  1|
+---+---+---+---+
Sign up to request clarification or add additional context in comments.

2 Comments

That certainly did the job. Interesting how Pyspark works with row_number. Thanks a lot!
@Strayhorn your welcome, I recommend you to check how windows works in spark, it's not very difficult usually it contains partitionBy to create a partition per group of data having the same key then a orderBy to sort data inside each partition, then you can perform some action on each partition like row_number, sum, min, max ...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.