0

This is the data frame I am using and I am trying to subsample column V2(position) evenly (min:1130, max: 4406748) in a way that there is only one representative of column V4(lineage) in the final sample. I am trying to sample in a way that positions are evenly distributed while ensuring that I include only 1 representative of each group in the entire sample.

I have tried sorting and binning data but I cannot figure out how to evenly sample from it in a way that only 1 representative lineage is present in the data frame.

sorted_barcodes <- tb_profiler_barcodes %>% arrange(V2)
# bin the data to N bins
binned_sorted <- sorted_df %>%
  mutate(bin = cut(V2, breaks = 150, labels = FALSE)) 

I would appreciate your help.

0

2 Answers 2

2

I think you can't satisfy both conditions (unique V4 and evenly distributed V2) perfectly, unless you take a really small sample. If you relax the "evenly distributed" condition slightly, you can ensure uniqueness for V4. One way to do that is to use the strata function from the sampling package. This function allows you to perform stratified sampling, where you can specify that only 1 of each value in V4 be included. The distribution of V2 should (theoretically) be even, although with random sampling you may get a bad sample.

tb_profiler_barcodes <- read.table("tbdb.barcode.bed", sep="\t")

library(sampling)
library(dplyr)

sorted_barcodes <- tb_profiler_barcodes %>% arrange(V2)

size <- length(unique(tb_profiler_barcodes$V4)) # number of strata
n <- nrow(tb_profiler_barcodes)

set.seed(123) # Omit in practice
s <- strata(sorted_barcodes, 
            stratanames="V4",
            size=rep(1, 126),  # select only 1 from each strata
            method="srswor")

sample <- getdata(sorted_barcodes, s)

Check the uniqueness of V4:

any(duplicated(sample$V4))  # [1] FALSE

Check the distribution of V2:

plot(sorted_barcodes$V2, rep(1, n), pch=19, 
     xlab="", ylab="", yaxt="n", xaxt="n", col="red")
points(sample$V2, rep(0.8, size), pch=20, col="blue")
legend("toplef", legend=c("Data", "Sample"), 
       col=c("red", "blue"),
       pch=c(19,20), bty="n")

enter image description here

If the distribution doesn't look even enough to you, then repeat the sampling (without the seed) until you get one that looks better.

Sign up to request clarification or add additional context in comments.

Comments

1

This could be approached as an assignment problem, with the cost equal to the distance from an "ideal" distribution of V2 (position) values.

First get the ideal spacing.

r <- range(tb_profiler_barcodes$V2)
n <- length(unique(tb_profiler_barcodes$V4))
ideal <- seq(0.5, n - 0.5)*diff(r)/n + r[1] # ideal "even" spacing

Get the distance between each value in V2 and the ideal sample locations.

d <- outer(tb_profiler_barcodes$V2, ideal, \(x, y) abs(x - y))

For each value in V4 (lineage), get the best candidate for each ideal location (my go-to is data.table for group operations). This is the row number of the column minimum by lineage.

library(data.table)

idx <- as.matrix(
  cbind(data.table(lineage = tb_profiler_barcodes$V4)[,ID := .I], d)[
          ,lapply(.SD, \(x) ID[which.min(x)]), lineage, .SDcols = 3:(n + 2)
        ][,lineage := NULL]
)

Get the distance for each row index.

mindists <- idx
mindists[] <- d[cbind(c(idx), c(col(idx)))]

Solve the assignment problem and take the samples.

samples <- tb_profiler_barcodes[
  idx[RcppHungarian::HungarianSolver(mindists)$pairs],
]

You can check the distribution with, e.g., plot(sort(samples$V2)), which will show the points are nearly linear.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.