2

I want to sample below data set based on IDs and the comm_type they fall into; The same IDs can have multiple comm_types, the data set is huge so I want to do further analysis on a smaller sample of 1 million unique IDs; I see there is a sampleBy(col, fractions, seed=None), method to perform this but I need to group the data by comm_type and then sample by IDs, I am struggling to figure out the best way to do it. There are other fields in the dataset as well but the sampling needs to happen on these two columns.

The fractions for the comm_type should match the original data in the DF, E = 0.5, M = 0.4, P= 0.1, and the unique IDs in original DF is around 19 M, I only need to sample 1 M of the dataset keeping the comm_type fractions consistent to the original dataset.

enter image description here

Will appreciate any help or direction.

1 Answer 1

1

You can use scikit learn train_test_split function. Function accepts multiple columns for strata.

sklearn.model_selection.train_test_split(*arrays, test_size=None, 
train_size=None, random_state=None, shuffle=True, stratify=df[columns to 
stratify])
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.