Batch Sampler does not seem to work

PhysicsIsFun · March 25, 2024, 6:57pm

Greetings,

I would like to do experiments with varying batch sizes during model training. I am working with simple feed forward networks at the moment.

I came across the sampler class that one can pass to the Dataloader. I tried to use one of the examples I found online (code below), but when trying to compute the loss (BCE loss), I get the error

ValueError: Using a target size (torch.Size([1, 50])) that is different to the input size (torch.Size([50])) is deprecated. Please ensure they have the same size.

The same code, however, worked for the usual Dataloaders with constant batch size. Something about the dimensions is going wrong with my sampler. Would appreciate any help!

## Sampler Class
class VariableBatchSampler(torch.utils.data.Sampler):
    def __init__(self, dataset_len: int, batch_sizes: list):
        self.dataset_len = dataset_len
        self.batch_sizes = batch_sizes
        self.batch_idx = 0
        self.start_idx = 0
        self.end_idx = self.batch_sizes[self.batch_idx]
        
    def __iter__(self):
        return self
       
    def __next__(self):
        if self.start_idx >= self.dataset_len:
            raise StopIteration()
 
        batch_indices = torch.arange(self.start_idx, self.end_idx, dtype=torch.int)
        self.start_idx += (self.end_idx - self.start_idx)
        self.batch_idx += 1

        try:
            self.end_idx += self.batch_sizes[self.batch_idx]
        except IndexError:
            self.end_idx = self.dataset_len
             
        return batch_indices


## Initizalization in main
train_set = torch.utils.data.TensorDataset(Xtrain, ytrain)

batch_sampler = VariableBatchSampler(len(Xtrain), [50,500])
train_loader = torch.utils.data.DataLoader(train_set, sampler=batch_sampler) # allows iterating

The error is raised in the training loop:

 for data, target in data_loader:
                data, target = data.to(device), target.to(device)
                output = self.forward(data).squeeze()
                loss += criterion(output, target).data.item()   
                ## and so on

ptrblck · March 27, 2024, 1:13pm

Based on the error message it seems the shapes of your model output and targets do not match and thus the loss calculation is failing. I don’t think the issue is necessarily related to the BatchSampler and you can unsqueeze() the output or permute the target depending what the actual batch size and feature dimension is.
I would assume you are using 50 samples and are working on a binary classification.
If so, both the model output and target should have the shape [50, 1].

PhysicsIsFun · March 27, 2024, 3:17pm

Thanks ptrblck!

It is indeed binary classification and the batch size is 50 in the example.

Changing the line
`python

output = self.forward(data).squeeze()

to

output = self.forward(data).reshape(target.shape)

solves the issue and also works in the case with constant batch size.

I do get another error with the sampler, though:


  File ~/path_to_file/code.py:582 in train_func
    (data, target) = next(iter(self.train_loader))          # compute initial gradients

  File ~/miniconda3/envs/PYTORCH/lib/python3.11/site-packages/torch/utils/data/dataloader.py:631 in __next__
    data = self._next_data()

  File ~/miniconda3/envs/PYTORCH/lib/python3.11/site-packages/torch/utils/data/dataloader.py:674 in _next_data
    index = self._next_index()  # may raise StopIteration

  File ~/miniconda3/envs/PYTORCH/lib/python3.11/site-packages/torch/utils/data/dataloader.py:621 in _next_index
    return next(self._sampler_iter)  # may raise StopIteration

What I try to do in train_func is to obtain the first gradient manually. I need it to initialize a help variable. Once again, it works for normal dataloaders, but not when passing the sampler argument.

ptrblck · March 27, 2024, 7:12pm

Could you post the error message you are seeing using your BatchSampler?

PhysicsIsFun · March 31, 2024, 11:02pm

I created a small example. I think I understand something wrong about what the sampler does.

import torch
import torch.nn as nn


# for variable batch sizes
class VariableBatchSampler(torch.utils.data.Sampler):
    def __init__(self, dataset_len: int, batch_sizes: list):
        self.dataset_len = dataset_len
        self.batch_sizes = batch_sizes
        self.batch_idx = 0
        self.start_idx = 0
        self.end_idx = self.batch_sizes[self.batch_idx]
        
    def __iter__(self):
        return self
       
    def __next__(self):
        if self.start_idx >= self.dataset_len:
            raise StopIteration()
 
        batch_indices = torch.arange(self.start_idx, self.end_idx, dtype=torch.int)
        self.start_idx += (self.end_idx - self.start_idx)
        self.batch_idx += 1

        try:
            self.end_idx += self.batch_sizes[self.batch_idx]
        except IndexError:
            self.end_idx = self.dataset_len
             
        return batch_indices


## define some random data set
Npoints = 500
Xtrain = torch.normal(mean=torch.zeros((Npoints,2)))
Ytrain = torch.zeros(Npoints)
Ytrain[0:Npoints//2] = 1

train_set = torch.utils.data.TensorDataset(Xtrain, Ytrain) # turns it to TensorDataset


batch_sampler = VariableBatchSampler(len(Xtrain), [50,100])
train_loader = torch.utils.data.DataLoader(train_set, sampler=batch_sampler) 

(data, target) = next(iter(train_loader))  
print(data.size())                              # prints torch.Size([1, 50, 2])


(data, target) = next(iter(train_loader))
print(data.size())                              # prints torch.Size([1, 100, 2])


(data, target) = next(iter(train_loader)) 
print(data.size())                              # prints torch.Size([1, 350, 2]), why?


(data, target) = next(iter(train_loader)) 
print(data.size())                              # raises stop iteration error

I expected it to give me batches of size 50, 100, 50, 100 and so on. Yet one of the batches is of size 350. Also, if the end of a the dataset is reached, asking for a new batch leads to the error. I would like it to simply keep giving batches as with the dataloader without the sampler.