I am trying to implement an algorithm for sampling in several stages where only the final size of the sample is known.
Here is an example of the structure of my sampling frame. Where:
- cluster is a block of households.
- total_households is the number of household in each block
- group is a grouping of blocks depending on the number of households in the blocks.
- Probability is the probability of select one group.
Then, the algorithm have the next steps: Given a sample size $n$
- Select one group with unequal probalities with simple random sampling whith replacement.
- Select with simple random sampling without replace one cluster whithin the group selected in the previous step and remove it from the sampling frame.
- in the previously selected cluster, select only 25% of households.
- Repeat until the exact sample size is reached
Because
cluster total_households group Probability
1 173494 13 2 4.055410e-01
2 173495 19 5 4.176953e-02
3 173496 22 5 4.176953e-02
4 173497 21 5 4.176953e-02
5 173498 18 5 4.176953e-02
6 173499 27 7 6.775638e-05
7 173500 15 4 5.020529e-01
8 173501 19 5 4.176953e-02
I want to implement this algorithm with R. I know there is a package for this called sampling with the multistage function, but it does not work. Because, I must specify the number of clusters and groups before implementing the algorithm. My programming skills are limited. I've been trying to do something with a while loop, but I think I'm far from the correct result.
require(dplyr) # to use pipes in the code
n_sample = 844
group = NULL
total = NULL
cluster = NULL
total_households = NULL
total = 0
i = 1
while(total < n_sample){
group[i] = groups[sample(nrow(groups),size = 1,prob = groups$P),c("group")]
total_households = data[data$group==group[i],] %>%
sample_n(size=1) %>%
select(total_households)
cluster[i] = data[data$group==group[i],] %>%
sample_n(size=1) %>%
select(cluster) %>% as.numeric()
data = data[data$cluster!=cluster[i],]
total = total+total_households
i = i+1
}