I have the following spark dataframe:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('').getOrCreate()
df = spark.createDataFrame([(1, "a", "2"), (2, "b", "2"),(3, "c", "2"), (4, "d", "2"),
(5, "b", "3"), (6, "b", "3"),(7, "c", "2")], ["nr", "column2", "quant"])
which returns me:
+---+-------+------+
| nr|column2|quant |
+---+-------+------+
| 1| a| 2|
| 2| b| 2|
| 3| c| 2|
| 4| d| 2|
| 5| b| 3|
| 6| b| 3|
| 7| c| 2|
+---+-------+------+
I would like to retrieve the rows where for each 3 groupped rows (from each window where window size is 3) quant column has unique values. as in the following pic:
Here red is window size and each window i keep only green rows where quant is unique:
The ouptput that i would like to get is as following:
+---+-------+------+
| nr|column2|values|
+---+-------+------+
| 1| a| 2|
| 4| d| 2|
| 5| b| 3|
| 7| c| 2|
+---+-------+------+
I am new in spark so, I would appreciate any help. Thanks
