2

I have a dataframe df with patients subject_id, including their gender and their age.

I would like to draw a random sample of size n from this dataframe, with the following characteristics:

  • 50% male, 50% female
  • Median age of 40 years

Any idea how I could accomplish that using python? Thank you!

1

1 Answer 1

1

I think what you want is a little bit more complex than what DataFrame.sample provides out of the box. A random sample satisfying each of your conditions could be generated (respectively) like this:

  1. Filter for women only, and randomly sample n/2, then do the same for men, and then pool them
  2. Filter for under 40s, randomly sample n/2, then do the same for over-40s and then combine them. (Though note that this does not guarantee a median of exactly 40.)

If you want to combine the two constraints, you might need to sample 4 times - women under 40, men under 40, etc. But this is the general idea.

Code for sampling would look like:

df.loc[df.age < 40, 'subject_id'].sample(n/2)
df.loc[df.gender == 'F', 'subject_id'].sample(n/2)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.