I want to sample below data set based on IDs and the comm_type they fall into; The same IDs can have multiple comm_types, the data set is huge so I want to do further analysis on a smaller sample of 1 million unique IDs; I see there is a sampleBy(col, fractions, seed=None), method to perform this but I need to group the data by comm_type and then sample by IDs, I am struggling to figure out the best way to do it. There are other fields in the dataset as well but the sampling needs to happen on these two columns.
The fractions for the comm_type should match the original data in the DF, E = 0.5, M = 0.4, P= 0.1, and the unique IDs in original DF is around 19 M, I only need to sample 1 M of the dataset keeping the comm_type fractions consistent to the original dataset.
Will appreciate any help or direction.
