How to sample evenly from a data frame having only one representative per group?

Question

This is the data frame I am using and I am trying to subsample column V2(position) evenly (min:1130, max: 4406748) in a way that there is only one representative of column V4(lineage) in the final sample. I am trying to sample in a way that positions are evenly distributed while ensuring that I include only 1 representative of each group in the entire sample.

I have tried sorting and binning data but I cannot figure out how to evenly sample from it in a way that only 1 representative lineage is present in the data frame.

sorted_barcodes <- tb_profiler_barcodes %>% arrange(V2)
# bin the data to N bins
binned_sorted <- sorted_df %>%
  mutate(bin = cut(V2, breaks = 150, labels = FALSE))

I would appreciate your help.

Edward · Accepted Answer · 2024-08-07 03:34:33Z

I think you can't satisfy both conditions (unique V4 and evenly distributed V2) perfectly, unless you take a really small sample. If you relax the "evenly distributed" condition slightly, you can ensure uniqueness for V4. One way to do that is to use the strata function from the sampling package. This function allows you to perform stratified sampling, where you can specify that only 1 of each value in V4 be included. The distribution of V2 should (theoretically) be even, although with random sampling you may get a bad sample.

tb_profiler_barcodes <- read.table("tbdb.barcode.bed", sep="\t")

library(sampling)
library(dplyr)

sorted_barcodes <- tb_profiler_barcodes %>% arrange(V2)

size <- length(unique(tb_profiler_barcodes$V4)) # number of strata
n <- nrow(tb_profiler_barcodes)

set.seed(123) # Omit in practice
s <- strata(sorted_barcodes, 
            stratanames="V4",
            size=rep(1, 126),  # select only 1 from each strata
            method="srswor")

sample <- getdata(sorted_barcodes, s)

Check the uniqueness of V4:

any(duplicated(sample$V4))  # [1] FALSE

Check the distribution of V2:

plot(sorted_barcodes$V2, rep(1, n), pch=19, 
     xlab="", ylab="", yaxt="n", xaxt="n", col="red")
points(sample$V2, rep(0.8, size), pch=20, col="blue")
legend("toplef", legend=c("Data", "Sample"), 
       col=c("red", "blue"),
       pch=c(19,20), bty="n")

If the distribution doesn't look even enough to you, then repeat the sampling (without the seed) until you get one that looks better.

jblood94 · Accepted Answer · 2024-08-07 13:20:49Z

This could be approached as an assignment problem, with the cost equal to the distance from an "ideal" distribution of V2 (position) values.

First get the ideal spacing.

r <- range(tb_profiler_barcodes$V2)
n <- length(unique(tb_profiler_barcodes$V4))
ideal <- seq(0.5, n - 0.5)*diff(r)/n + r[1] # ideal "even" spacing

Get the distance between each value in V2 and the ideal sample locations.

d <- outer(tb_profiler_barcodes$V2, ideal, \(x, y) abs(x - y))

For each value in V4 (lineage), get the best candidate for each ideal location (my go-to is data.table for group operations). This is the row number of the column minimum by lineage.

library(data.table)

idx <- as.matrix(
  cbind(data.table(lineage = tb_profiler_barcodes$V4)[,ID := .I], d)[
          ,lapply(.SD, \(x) ID[which.min(x)]), lineage, .SDcols = 3:(n + 2)
        ][,lineage := NULL]
)

Get the distance for each row index.

mindists <- idx
mindists[] <- d[cbind(c(idx), c(col(idx)))]

Solve the assignment problem and take the samples.

samples <- tb_profiler_barcodes[
  idx[RcppHungarian::HungarianSolver(mindists)$pairs],
]

You can check the distribution with, e.g., plot(sort(samples$V2)), which will show the points are nearly linear.

Collectives™ on Stack Overflow

How to sample evenly from a data frame having only one representative per group?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related