Is there a way to create a loop where I provide a function and dataframe and subsample it, and repeat the function with a subsample N times?

Question

I am not sure what the correct word for this would be, so apologies for getting the terminology horribly wrong. Basically I have about 1000 datapoints, and I want to randomly subsample 100 data points 999 times and perform the same function (a generalised least squares model) on each subsample, and see how often the correlation would be significant.

I am also adding some more context, in case it helps. My data is in a data frame with various columns, and I am doing a comparing if there is a relationship between altitude and dichromatism, and whether the relationship between the two varies depending on whether dichromatism is measured using a spectrophotometer or human scoring. I also include latitude centroid of species range in these models, so the PGLS for each looks as follows:

PGLS_VO_Score <- gls(Colour_discriminability_Absolute ~ Altitude_Reported*Centroid.Abs, 
                          correlation = corPagel(1, phy = AvianTreeEdge, form = ~Species), 
                          data = VO_HumanScores_Merged, method = "ML")

PGLS_Human_Score <- gls(Human_Score ~ Altitude_Reported*Centroid.Abs, 
                        correlation = corPagel(1, phy = AvianTreeEdge, form = ~Species), 
                        data = VO_HumanScores_Merged, method = "ML")

And the data frame of VO_Human_Scores_Merged included a columnn for species names, for Human Scores, for spectrophotometer scores, altitude, latitude, and then some transformed values of those (log transformed, etc.) which I did to begin with in case I needed to to transform the data to meet the assumptions of the PGLS.

r2evans · Accepted Answer · 2023-09-09 17:32:27Z

1

A pipeline sampling helps to view what can be done here:

myfun <- function(x) cor(x[[1]], x[[3]])
set.seed(42)
replicate(5, mtcars[sample(nrow(mtcars), 10),], simplify=FALSE) |>
  lapply(myfun)
# [[1]]
# [1] -0.8130999
# [[2]]
# [1] -0.8633841
# [[3]]
# [1] -0.7967049
# [[4]]
# [1] -0.901294
# [[5]]
# [1] -0.8761853

(My 5 is your 999, my 10 is your 100.)

The simplify=FALSE is required since otherwise replicate will reduce to a (nested) matrix, not what we want. My myfun is contrived, use whatever function you want.

The (perhaps only) advantage to breaking it out into two (or more) steps in a pipeline is that if you want to go back to revisit the random sampling, it's much simpler if you save that random sampling. For example,

set.seed(42)
sampdat <- replicate(5, mtcars[sample(nrow(mtcars), 10),], simplify=FALSE)
lapply(sampdat, myfun)
# [[1]]
# [1] -0.8130999
# [[2]]
# [1] -0.8633841
# [[3]]
# [1] -0.7967049
# [[4]]
# [1] -0.901294
# [[5]]
# [1] -0.8761853

If you later realize you need to do something else with the sample data (another metric or whatever) and you don't (for time, memory, or convenience) want to have to rerun all of the other sample-aggregations, you can re-use sampdat.

answered Sep 9, 2023 at 17:32

r2evans

167k8 gold badges92 silver badges176 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

PowellHall Over a year ago

Thank you for your comment. I think I did something wrong, and am not sure why, because every output I got was the exact same, which I do not believe is what is meant to happen. This is what I put in

myfun <- function(PGLS_VO_Scores) cor(VO_HumanScores_Merged$Colour_discriminability_Absolute,                                         VO_HumanScores_Merged$Altitude_Reported)  BirdReplicationAttempt <- replicate(999, VO_HumanScores_Merged[sample(nrow(VO_HumanScores_Merged), 100),], simplify=FALSE) |>   lapply(myfun)

PowellHall Over a year ago

I have added more context to the original query in case that helps in understanding where the error occurred

r2evans Over a year ago

You write a function that accepts as its sole argument PGLS_VO_Scores but never use it, instead choosing to breach scope and grab data from something else entirely. The function is supposed to take sample data and do something with that sample data, not data that might (or might not) be in some calling environment.

r2evans Over a year ago

Try changing your function to myfun <- function(x) cor(x$Colour_discriminability_Absolute, x$Altitude_Reported) and rerun your replication.

PowellHall Over a year ago

Thanks, that seems to have worked. And just to confirm, the output of that, is that the p values of the correlation? Or the correlation itself?

|

Mitchell Olislagers · Accepted Answer · 2023-09-09 17:28:49Z

0

You can take a random sample from your datapoints using sample. Then you can run your function n times using replicate. An example that takes a random sample of n=100 and computes the mean 10 times:

> set.seed(1)
> datapoints <- runif(1000, max = 10000)
> result <- replicate(10, mean(sample(datapoints, 100)))
5194.298 5063.320 5064.992 4681.281 5008.011 4849.998 5320.206 5012.931 4900.636 4776.135

answered Sep 9, 2023 at 17:28

Mitchell Olislagers

1,8271 gold badge6 silver badges11 bronze badges

1 Comment

PowellHall Over a year ago

Thank you for your comment. I tried to do this using the PGLS function which I want to rerun, replacing that for the "mean". and replacing "datapoints" for my data set, so it reads as follows: replicate(999, PGLS_VO_Score(sample(VO_HumanScores_Merged, 100))), but I only got an error, as follows: Error in PGLS_VO_Score(sample(VO_HumanScores_Merged, 100) : could not find function "PGLS_VO_Score" Is there a way to resolve this so that it recognises the function which I used for the entire dataset as the function I want to apply to each subset?

Collectives™ on Stack Overflow

Is there a way to create a loop where I provide a function and dataframe and subsample it, and repeat the function with a subsample N times?

2 Answers 2

6 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related