2

I have a data set from a survey that has respondents classified by certain demographic data values. The layout of the data is basically this:

      Gender   Age    Income    Region
1     Male     1      2         West
2     Male     4      2         South
3     Male     4      3         West
4     Female   4      1         Northeast
5     Female   5      2         West
6     Female   3      2         West
7     Male     1      1         South
8     Male     3      3         Northeast
9     Female   2      3         West
10    Female   4      3         Midwest

I used this to generate the example above. I am open to comment on how to do this better of course.

regions <- c('Midwest','Northeast','South','West')
incomes <- c('1','2','3','4')
gender <- c('Male','Female')
age <- c('1','2','3','4','5')
data <- data.frame(stringsAsFactors = FALSE)

for(i in 1:100){  
  z <- data.frame(sample(gender,1),sample(age,1),sample(incomes,1),sample(regions,1))  
  data <- rbind(data, z)
}
colnames(data) <- c("Gender","Age","Income","Region")
data

I need to break that data set into subsets that each represent the original set. That would include each subset having the same percentage of genders, age groups, income group, and region. I understand exact representation might be difficult for that many factors and a small number of rows.

There is a second part to my problem. R has many built in functions that make unambiguously describing a problem like this difficult. Split, data, factors, values, subset, and words that we might use interchangeably when talking about data in general but do not exactly yield the right answer when typed into Google or Stack Overflow. I would like to know if there is more technically precise I should use to describe my problem.

9
  • 1
    Not sure if this helps df1 <- do.call(expand.grid, list(regions, incomes, gender, age)); df2 <- df1[sample(1:nrow(df1), 100, replace=FALSE),] Commented May 22, 2015 at 17:36
  • It sounds from your description like you are trying to sample from the data (by creating subsets that have the same characteristics as the larger dataset). That's an unusual thing to do and it would help if you explained why you need these subsets (as opposed to say, doing by-group analyses of gender, or region, and so on.) Commented May 22, 2015 at 18:02
  • We have a large group of people to pull from, but did not know they were going to be subset afterwards. If we did, we could have assigned these groups during collection Each person is being recruited into testing a product. I would need some subsets that would then each be analyzed afterwards. For example, each group would be testing a new soda and we want a general idea of how people across all demographics would respond. The actual data set we are using has people that are representative of the US as a whole as far as race, income, region etc. Commented May 22, 2015 at 18:08
  • 1
    So it is, essentially, a sampling problem (wanting to create "representative samples" from the larger dataset). Would the material in this question (stackoverflow.com/questions/16493920/…) be of use for you? Commented May 22, 2015 at 18:52
  • 1
    You could do this quite easily using dplyr by doing: sample_dat <- dat %>% group_by(Gender, Age, Income, Region) %>% sample_frac(0.5) which would sample 50% of your observations by the variables in group_by. And the 2nd part of your question, you should simply search "stratified sampling". The type of routine you're looking at is sampling without replacement though you could also sample with replacement. Commented May 23, 2015 at 1:17

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.