Pyspark dataframe filter using occurrence based on column

Question

I have pyspark dataframe and I want to filter dataframe with columns A and B. Now I want to get only values of B where occurrence of A is greater than some number N.

Column A is like and id which can have repeated values. Right now I am doing group by and the filtering and using list of values which is not efficient so I am looking for efficient solution.

Example

N = 5

Input Image

Expected Output Image

You can see there that only ID1 and ID3 of column A is selected because of threshold of 5 rest all are excluded.

You may want to look at this question stackoverflow.com/questions/45395093/… — skibee
– skibee, Commented May 6, 2020 at 7:03

user · Accepted Answer · 2018-08-27 05:30:11Z

1

Try the follwoing:

df = ... # The dataframe
N = 5 # The value to test
df_b = df.filter(df['A'] >= N).select('B')

This will first filter the dataframe only containing rows where A is >= N with its corresponding 'B' values. After applying the filter select only column B to obtain the final result.

answered Aug 27, 2018 at 5:30

user

7451 gold badge7 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Aditya Thakkar Over a year ago

I think this not what I am looking for. I think you missed the point of occurrences. please read question carefully. If this was as simple as that I wouldn't have posted question :)

user Over a year ago

Then please give an example of input and expected output. Whats the correct understanding of "repeated values"? Is it a nested structure like a list? concatenated values?

Aditya Thakkar Over a year ago

repeated means duplicate values.

user Over a year ago

So what you actualy want is the key (A) values with its corresponding B values like ID1: [B1, B6, B7...] ?

Aditya Thakkar Over a year ago

Yes you can find more by looking into input and output

Collectives™ on Stack Overflow

Pyspark dataframe filter using occurrence based on column

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related