How to do stratified sampling on two columns in PySpark Dataframe?

Question

I want to sample below data set based on IDs and the comm_type they fall into; The same IDs can have multiple comm_types, the data set is huge so I want to do further analysis on a smaller sample of 1 million unique IDs; I see there is a sampleBy(col, fractions, seed=None), method to perform this but I need to group the data by comm_type and then sample by IDs, I am struggling to figure out the best way to do it. There are other fields in the dataset as well but the sampling needs to happen on these two columns.

The fractions for the comm_type should match the original data in the DF, E = 0.5, M = 0.4, P= 0.1, and the unique IDs in original DF is around 19 M, I only need to sample 1 M of the dataset keeping the comm_type fractions consistent to the original dataset.

Will appreciate any help or direction.

ibozkurt79 · Accepted Answer · 2022-05-23 19:49:39Z

1

You can use scikit learn train_test_split function. Function accepts multiple columns for strata.

sklearn.model_selection.train_test_split(*arrays, test_size=None, 
train_size=None, random_state=None, shuffle=True, stratify=df[columns to 
stratify])

answered May 23, 2022 at 19:49

ibozkurt79

2534 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to do stratified sampling on two columns in PySpark Dataframe?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related