How to select some rows from a Pyspark dataframe column and add it to a new dataframe?

Question

I have 10 dataframes,df1...df10 with 2 columns:

`df1`

id | 2011_result,

id | 2012_result, ... ...

id| 2018_result

I want to do select some ids with 2011_result values less than a threshold.

 sample_ids=df1['2011_result']<threshold].sample(10)['id'].values

After this, I need to select the values for other columns from all other data frames for the list.

Something like this:

df2[df2['id'].isin(sample_ids)]['2012_result'] df3[df3['id'].isin(sample_ids)]['2013_result']

Could you please help out?

pissall · Accepted Answer · 2019-11-14 06:48:56Z

1

Firstly you can filter with:

import pyspark.sql.functions as F

sample_ids=df1.filter(F.col("2011_result") < threshold)

Then you can use left_anti join to filter out df2, df3, etc:

df2 = df2.join(sample_ids.select("id"), on="id", how="left_anti")

answered Nov 14, 2019 at 5:27

pissall

7,4442 gold badges29 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.