Hi I am new to this and for most application I have been using the dataloader in utils.data to load in batches of images. However I am now trying to load images in different batch size. For example my first iteration loads in batch of 10, second loads in batch of 20.
Thank you very much for your answers!!
I actually found what I wanted with the sampler in this discussion: 405015099 and changing the batch size with a batch_size for each source (here my data_source is the concatenation of datasets with specific batch_size for each).
Not very clean but seems to work.
class ClusterRandomSampler(Sampler):
r"""Takes a dataset with cluster_indices property, cuts it into batch-sized chunks
Drops the extra items, not fitting into exact batches
Arguments:
data_source (Dataset): a Dataset to sample from. Should have a cluster_indices property
batch_size (int): a batch size that you would like to use later with Dataloader class
shuffle (bool): whether to shuffle the data or not
"""
def __init__(self, data_source, batch_size=None, shuffle=True):
self.data_source = data_source
if batch_size is not None:
assert self.data_source.batch_sizes is None, "do not declare batch size in sampler " \
"if data source already got one"
self.batch_sizes = [batch_size for _ in self.data_source.cluster_indices]
else:
self.batch_sizes = self.data_source.batch_sizes
self.shuffle = shuffle
def flatten_list(self, lst):
return [item for sublist in lst for item in sublist]
def __iter__(self):
batch_lists = []
for j, cluster_indices in enumerate(self.data_source.cluster_indices):
batches = [
cluster_indices[i:i + self.batch_sizes[j]] for i in range(0, len(cluster_indices), self.batch_sizes[j])
]
# filter our the shorter batches
batches = [_ for _ in batches if len(_) == self.batch_sizes[j]]
if self.shuffle:
random.shuffle(batches)
batch_lists.append(batches)
# flatten lists and shuffle the batches if necessary
# this works on batch level
lst = self.flatten_list(batch_lists)
if self.shuffle:
random.shuffle(lst)
return iter(lst)
def __len__(self):
return len(self.data_source)
I have been trying to use collate_fn for this purpose but haven’t figured out how, yet. Can you give any pointers? My problem right now is the sampler gives collate_fn 16 samples at a time, but I want the batch size to be 128. Is this possible with this approach?
The collate_fn is used to process the batch of samples in a custom way. It doesn’t specify the batch size, which is set in the DataLoader.
Could you explain your issue a bit more, i.e. are you setting a batch size of 128 in the DataLoader and each batch contains just 16 samples?
I’m trying to replicate the original StyleGAN’s batch size schedule: 128, 128, 128, 64, 32, 16 as the progressive growing is applied. I know I can recreate the DataLoader when I want to switch, but I’m working inside an extant framework that makes that a clunky change to make.
I never did figure out how to use collate_fn here so instead, I’m initializing my DataLoader with a batch_size of 16, and in my training loop I collect and concatenate these batches until I reach the actual batch size I want at any given time. This only works because all the batch sizes are divisible by 16. I tried to do this in collate_fn at first, I thought maybe it received a generator and I could return a different generator, but that wasn’t the case.
I’m still interested to know how collate_fn can be used to yield variable batch sizes, maybe it would be cleaner than my solution.
You could use this code snippet to see an example.
Note that the “variable size” is usually the temporal dimension or the spatial dimensions (e.g. images with a different resolution) not the batch size.
I’ve come across the same issue while trying to implement this functionality of StyleGAN using PyTorch Lightning, which I believe is like your use case. Any luck on your end in resolving this issue?
I also wish there was a way to do this, e.g. make the batch size depend on the size of the input samples in the case of variable-length input data, such as sentences, etc.