Dataloader for variable batch size

hij · February 20, 2018, 8:32am

Hi I am new to this and for most application I have been using the dataloader in utils.data to load in batches of images. However I am now trying to load images in different batch size. For example my first iteration loads in batch of 10, second loads in batch of 20.

Is there a way to do this easily? Thank you.

turpaultn · July 30, 2018, 12:35pm

Hi,
Same problem here,
Did you succeed to do it ?

ptrblck · July 30, 2018, 12:40pm

You could implement a custom collate_fn for your DataLoader and use it to load your batches.

justusschock · July 30, 2018, 12:40pm

I think the easiest way to achieve this is to change the batch_size parameter of the Dataloader.

turpaultn · July 30, 2018, 1:11pm

Thank you very much for your answers!!
I actually found what I wanted with the sampler in this discussion: 405015099 and changing the batch size with a batch_size for each source (here my data_source is the concatenation of datasets with specific batch_size for each).
Not very clean but seems to work.

class ClusterRandomSampler(Sampler):
    r"""Takes a dataset with cluster_indices property, cuts it into batch-sized chunks
    Drops the extra items, not fitting into exact batches
    Arguments:
        data_source (Dataset): a Dataset to sample from. Should have a cluster_indices property
        batch_size (int): a batch size that you would like to use later with Dataloader class
        shuffle (bool): whether to shuffle the data or not
    """

    def __init__(self, data_source, batch_size=None, shuffle=True):
        self.data_source = data_source
        if batch_size is not None:
            assert self.data_source.batch_sizes is None, "do not declare batch size in sampler " \
                                                         "if data source already got one"
            self.batch_sizes = [batch_size for _ in self.data_source.cluster_indices]
        else:
            self.batch_sizes = self.data_source.batch_sizes
        self.shuffle = shuffle

    def flatten_list(self, lst):
        return [item for sublist in lst for item in sublist]

    def __iter__(self):

        batch_lists = []
        for j, cluster_indices in enumerate(self.data_source.cluster_indices):
            batches = [
                cluster_indices[i:i + self.batch_sizes[j]] for i in range(0, len(cluster_indices), self.batch_sizes[j])
            ]
            # filter our the shorter batches
            batches = [_ for _ in batches if len(_) == self.batch_sizes[j]]
            if self.shuffle:
                random.shuffle(batches)
            batch_lists.append(batches)

            # flatten lists and shuffle the batches if necessary
        # this works on batch level
        lst = self.flatten_list(batch_lists)
        if self.shuffle:
            random.shuffle(lst)
        return iter(lst)

    def __len__(self):
        return len(self.data_source)

R0dluvan · October 11, 2020, 10:45am

I have been trying to use collate_fn for this purpose but haven’t figured out how, yet. Can you give any pointers? My problem right now is the sampler gives collate_fn 16 samples at a time, but I want the batch size to be 128. Is this possible with this approach?

ptrblck · October 12, 2020, 12:24am

The collate_fn is used to process the batch of samples in a custom way. It doesn’t specify the batch size, which is set in the DataLoader.
Could you explain your issue a bit more, i.e. are you setting a batch size of 128 in the DataLoader and each batch contains just 16 samples?

R0dluvan · October 12, 2020, 6:29am

I’m trying to replicate the original StyleGAN’s batch size schedule: 128, 128, 128, 64, 32, 16 as the progressive growing is applied. I know I can recreate the DataLoader when I want to switch, but I’m working inside an extant framework that makes that a clunky change to make.

I never did figure out how to use collate_fn here so instead, I’m initializing my DataLoader with a batch_size of 16, and in my training loop I collect and concatenate these batches until I reach the actual batch size I want at any given time. This only works because all the batch sizes are divisible by 16. I tried to do this in collate_fn at first, I thought maybe it received a generator and I could return a different generator, but that wasn’t the case.

I’m still interested to know how collate_fn can be used to yield variable batch sizes, maybe it would be cleaner than my solution.

ptrblck · October 12, 2020, 6:34am

You could use this code snippet to see an example.
Note that the “variable size” is usually the temporal dimension or the spatial dimensions (e.g. images with a different resolution) not the batch size.

R0dluvan · October 16, 2020, 7:43pm

That snippet again does not modify the batch size, which is the subject of this thread.

atherfawaz · November 30, 2020, 7:59pm

I’ve come across the same issue while trying to implement this functionality of StyleGAN using PyTorch Lightning, which I believe is like your use case. Any luck on your end in resolving this issue?

gengala · November 27, 2022, 7:40pm

Any news? How one could change the batch size of the dataloader during training?

FarzanT · May 23, 2023, 10:23pm

I also wish there was a way to do this, e.g. make the batch size depend on the size of the input samples in the case of variable-length input data, such as sentences, etc.

drscotthawley · May 22, 2025, 12:40am

I looks like there’s a library answer this call here: GitHub - ancestor-mithril/bs-scheduler: A Batch Size Scheduler library compatible with PyTorch DataLoaders.