1

I have pyspark dataframe and I want to filter dataframe with columns A and B. Now I want to get only values of B where occurrence of A is greater than some number N.

Column A is like and id which can have repeated values. Right now I am doing group by and the filtering and using list of values which is not efficient so I am looking for efficient solution.

Example

N = 5

Input Image

Expected Output Image

You can see there that only ID1 and ID3 of column A is selected because of threshold of 5 rest all are excluded.

2

1 Answer 1

1

Try the follwoing:

df = ... # The dataframe
N = 5 # The value to test
df_b = df.filter(df['A'] >= N).select('B')

This will first filter the dataframe only containing rows where A is >= N with its corresponding 'B' values. After applying the filter select only column B to obtain the final result.

Sign up to request clarification or add additional context in comments.

5 Comments

I think this not what I am looking for. I think you missed the point of occurrences. please read question carefully. If this was as simple as that I wouldn't have posted question :)
Then please give an example of input and expected output. Whats the correct understanding of "repeated values"? Is it a nested structure like a list? concatenated values?
repeated means duplicate values.
So what you actualy want is the key (A) values with its corresponding B values like ID1: [B1, B6, B7...] ?
Yes you can find more by looking into input and output

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.