How to use balanced sampler for torch Dataset/Dataloader

Question

My simplified Dataset looks like:

class MyDataset(Dataset):
    def __init__(self) -> None:
        super().__init__()
        self.images: torch.Tensor[n, w, h, c]   # n images in memmory - specific use case
        self.labels: torch.Tensor[n, w, h, c]   # n images in memmory - specific use case
        self.positive_idx: List                 # positive 1 out of 10000 negative
        self.negative_idx: List
        
    def __len__(self):
        return 10000 # fixed value for training
        
    def __getitem__(self, idx):
        return self.images[idx], self.labels[idx]
    

ds = MyDataset()
dl = DataLoader(ds, batch_size=100, shuffle=False, sampler=...)   
# Weighted Sampler? Shuffle False because I guess the sampler should process shuffling.

What is the most "torch" way of balancing the sampling for Dataloader so the batch will be constructed as 10 positive + 90 random negative in each epoch and in case of not enough positive duplicating the possible ones?

For the purpose of this exercise I'm not implementing augmenting for increasing sample size of positives.

Can I down-sampling negative samples? If it's about triplet loss or contrastive loss, I would use hard negative mining — jupyter
– jupyter, Commented Mar 1, 2024 at 9:52
Hard negative mining would be kind of "workaround" but I'm sure it's easier to do it before training in specifying samples from which pool (positive/negative) are selected. — Mateusz Konopelski
– Mateusz Konopelski, Commented Mar 2, 2024 at 11:11
and down-sampling is of course a way to do. Just wanted to do with torch Sampler instead of iterating in smart way on indexes of positive and negative list. — Mateusz Konopelski
– Mateusz Konopelski, Commented Mar 2, 2024 at 14:00

jupyter · Accepted Answer · 2024-03-02 14:27:34Z

1

I think you can implement a Batch Sampler to choose which data point will be yield for your dataset via __getitem__

class NegativeSampler:

  def __init__(self, positive_idx, negative_idx):
     
    self.positive_idx = positive_idx
    self.negative_idx = negative_idx 

  def __iter__(self): # this function will return index for your custom dataset ```__getitem__(self, idx)```
    
    for i in range(n_batch):
      positive_idx_batch = random.sample(self.positive_idx, batch_size)
      negative_idx_batch = []

      for pos_idx in positive_idx_batch:
        negative_idx_batch.append()
    
    
      yield positive_idx_batch + negative_idx_batch

answered Mar 2, 2024 at 14:27

jupyter

3511 silver badge11 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to use balanced sampler for torch Dataset/Dataloader

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related