0

I am not sure what the correct word for this would be, so apologies for getting the terminology horribly wrong. Basically I have about 1000 datapoints, and I want to randomly subsample 100 data points 999 times and perform the same function (a generalised least squares model) on each subsample, and see how often the correlation would be significant.

I am also adding some more context, in case it helps. My data is in a data frame with various columns, and I am doing a comparing if there is a relationship between altitude and dichromatism, and whether the relationship between the two varies depending on whether dichromatism is measured using a spectrophotometer or human scoring. I also include latitude centroid of species range in these models, so the PGLS for each looks as follows:

PGLS_VO_Score <- gls(Colour_discriminability_Absolute ~ Altitude_Reported*Centroid.Abs, 
                          correlation = corPagel(1, phy = AvianTreeEdge, form = ~Species), 
                          data = VO_HumanScores_Merged, method = "ML")

PGLS_Human_Score <- gls(Human_Score ~ Altitude_Reported*Centroid.Abs, 
                        correlation = corPagel(1, phy = AvianTreeEdge, form = ~Species), 
                        data = VO_HumanScores_Merged, method = "ML")

And the data frame of VO_Human_Scores_Merged included a columnn for species names, for Human Scores, for spectrophotometer scores, altitude, latitude, and then some transformed values of those (log transformed, etc.) which I did to begin with in case I needed to to transform the data to meet the assumptions of the PGLS.

2 Answers 2

1

A pipeline sampling helps to view what can be done here:

myfun <- function(x) cor(x[[1]], x[[3]])
set.seed(42)
replicate(5, mtcars[sample(nrow(mtcars), 10),], simplify=FALSE) |>
  lapply(myfun)
# [[1]]
# [1] -0.8130999
# [[2]]
# [1] -0.8633841
# [[3]]
# [1] -0.7967049
# [[4]]
# [1] -0.901294
# [[5]]
# [1] -0.8761853

(My 5 is your 999, my 10 is your 100.)

The simplify=FALSE is required since otherwise replicate will reduce to a (nested) matrix, not what we want. My myfun is contrived, use whatever function you want.

The (perhaps only) advantage to breaking it out into two (or more) steps in a pipeline is that if you want to go back to revisit the random sampling, it's much simpler if you save that random sampling. For example,

set.seed(42)
sampdat <- replicate(5, mtcars[sample(nrow(mtcars), 10),], simplify=FALSE)
lapply(sampdat, myfun)
# [[1]]
# [1] -0.8130999
# [[2]]
# [1] -0.8633841
# [[3]]
# [1] -0.7967049
# [[4]]
# [1] -0.901294
# [[5]]
# [1] -0.8761853

If you later realize you need to do something else with the sample data (another metric or whatever) and you don't (for time, memory, or convenience) want to have to rerun all of the other sample-aggregations, you can re-use sampdat.

Sign up to request clarification or add additional context in comments.

6 Comments

Thank you for your comment. I think I did something wrong, and am not sure why, because every output I got was the exact same, which I do not believe is what is meant to happen. This is what I put in myfun <- function(PGLS_VO_Scores) cor(VO_HumanScores_Merged$Colour_discriminability_Absolute, VO_HumanScores_Merged$Altitude_Reported) BirdReplicationAttempt <- replicate(999, VO_HumanScores_Merged[sample(nrow(VO_HumanScores_Merged), 100),], simplify=FALSE) |> lapply(myfun)
I have added more context to the original query in case that helps in understanding where the error occurred
You write a function that accepts as its sole argument PGLS_VO_Scores but never use it, instead choosing to breach scope and grab data from something else entirely. The function is supposed to take sample data and do something with that sample data, not data that might (or might not) be in some calling environment.
Try changing your function to myfun <- function(x) cor(x$Colour_discriminability_Absolute, x$Altitude_Reported) and rerun your replication.
Thanks, that seems to have worked. And just to confirm, the output of that, is that the p values of the correlation? Or the correlation itself?
|
0

You can take a random sample from your datapoints using sample. Then you can run your function n times using replicate. An example that takes a random sample of n=100 and computes the mean 10 times:

> set.seed(1)
> datapoints <- runif(1000, max = 10000)
> result <- replicate(10, mean(sample(datapoints, 100)))
5194.298 5063.320 5064.992 4681.281 5008.011 4849.998 5320.206 5012.931 4900.636 4776.135

1 Comment

Thank you for your comment. I tried to do this using the PGLS function which I want to rerun, replacing that for the "mean". and replacing "datapoints" for my data set, so it reads as follows: replicate(999, PGLS_VO_Score(sample(VO_HumanScores_Merged, 100))), but I only got an error, as follows: Error in PGLS_VO_Score(sample(VO_HumanScores_Merged, 100) : could not find function "PGLS_VO_Score" Is there a way to resolve this so that it recognises the function which I used for the entire dataset as the function I want to apply to each subset?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.