Dataloader for variable batch size

Hi I am new to this and for most application I have been using the dataloader in utils.data to load in batches of images. However I am now trying to load images in different batch size. For example my first iteration loads in batch of 10, second loads in batch of 20.

Is there a way to do this easily? Thank you.

2 Likes

Hi,
Same problem here,
Did you succeed to do it ?

You could implement a custom collate_fn for your DataLoader and use it to load your batches.

I think the easiest way to achieve this is to change the batch_size parameter of the Dataloader.

Thank you very much for your answers!!
I actually found what I wanted with the sampler in this discussion: 405015099 and changing the batch size with a batch_size for each source (here my data_source is the concatenation of datasets with specific batch_size for each).
Not very clean but seems to work.

class ClusterRandomSampler(Sampler):
    r"""Takes a dataset with cluster_indices property, cuts it into batch-sized chunks
    Drops the extra items, not fitting into exact batches
    Arguments:
        data_source (Dataset): a Dataset to sample from. Should have a cluster_indices property
        batch_size (int): a batch size that you would like to use later with Dataloader class
        shuffle (bool): whether to shuffle the data or not
    """

    def __init__(self, data_source, batch_size=None, shuffle=True):
        self.data_source = data_source
        if batch_size is not None:
            assert self.data_source.batch_sizes is None, "do not declare batch size in sampler " \
                                                         "if data source already got one"
            self.batch_sizes = [batch_size for _ in self.data_source.cluster_indices]
        else:
            self.batch_sizes = self.data_source.batch_sizes
        self.shuffle = shuffle

    def flatten_list(self, lst):
        return [item for sublist in lst for item in sublist]

    def __iter__(self):

        batch_lists = []
        for j, cluster_indices in enumerate(self.data_source.cluster_indices):
            batches = [
                cluster_indices[i:i + self.batch_sizes[j]] for i in range(0, len(cluster_indices), self.batch_sizes[j])
            ]
            # filter our the shorter batches
            batches = [_ for _ in batches if len(_) == self.batch_sizes[j]]
            if self.shuffle:
                random.shuffle(batches)
            batch_lists.append(batches)

            # flatten lists and shuffle the batches if necessary
        # this works on batch level
        lst = self.flatten_list(batch_lists)
        if self.shuffle:
            random.shuffle(lst)
        return iter(lst)

    def __len__(self):
        return len(self.data_source)
2 Likes

I have been trying to use collate_fn for this purpose but haven’t figured out how, yet. Can you give any pointers? My problem right now is the sampler gives collate_fn 16 samples at a time, but I want the batch size to be 128. Is this possible with this approach?

The collate_fn is used to process the batch of samples in a custom way. It doesn’t specify the batch size, which is set in the DataLoader.
Could you explain your issue a bit more, i.e. are you setting a batch size of 128 in the DataLoader and each batch contains just 16 samples?

I’m trying to replicate the original StyleGAN’s batch size schedule: 128, 128, 128, 64, 32, 16 as the progressive growing is applied. I know I can recreate the DataLoader when I want to switch, but I’m working inside an extant framework that makes that a clunky change to make.

I never did figure out how to use collate_fn here so instead, I’m initializing my DataLoader with a batch_size of 16, and in my training loop I collect and concatenate these batches until I reach the actual batch size I want at any given time. This only works because all the batch sizes are divisible by 16. I tried to do this in collate_fn at first, I thought maybe it received a generator and I could return a different generator, but that wasn’t the case.

I’m still interested to know how collate_fn can be used to yield variable batch sizes, maybe it would be cleaner than my solution.

You could use this code snippet to see an example.
Note that the “variable size” is usually the temporal dimension or the spatial dimensions (e.g. images with a different resolution) not the batch size.

That snippet again does not modify the batch size, which is the subject of this thread.

I’ve come across the same issue while trying to implement this functionality of StyleGAN using PyTorch Lightning, which I believe is like your use case. Any luck on your end in resolving this issue?

Any news? How one could change the batch size of the dataloader during training?

2 Likes

I also wish there was a way to do this, e.g. make the batch size depend on the size of the input samples in the case of variable-length input data, such as sentences, etc.