I have pyspark dataframe and I want to filter dataframe with columns A and B. Now I want to get only values of B where occurrence of A is greater than some number N.
Column A is like and id which can have repeated values. Right now I am doing group by and the filtering and using list of values which is not efficient so I am looking for efficient solution.
Example
N = 5
Input Image
Expected Output Image
You can see there that only ID1 and ID3 of column A is selected because of threshold of 5 rest all are excluded.