Batch size larger than dataset size

HaziqRazali · September 30, 2019, 7:43am

How can i configure the dataloader to accept a batch size that is larger than the dataset size? Is it possible for the dataloader to continue sampling from the dataset?

uti_va_loader = torch.utils.data.DataLoader(uti_va_data, 
                                            batch_size=args.batch_size, 
                                            shuffle=True, 
                                            num_workers=0, 
                                            drop_last=False,
                                            pin_memory=torch.cuda.is_available())

This line is completely skipped if the batch size is larger than the dataset size.

for batch_idx, data in enumerate(uti_va_loader): 
    print(data.size())

JuanFMontesinos · September 30, 2019, 3:24pm

It’s not about the dataloader but dataset. Dataset is data-agnostic and it just iterates over a list of indices whose length is set at len dataset method.

How do you expect the dataloader to accept larget batch size if it cannot load non existing data? It will throw errors.

Anyway if you artificially enlarge the number that dataset.len returns you will be able to.

If what you wanna do is create a kind of infinite loop you can use built-in itertools’ repeat which allows you to iterate a iterator as many times as you want.
https://docs.python.org/2/library/itertools.html#itertools.repeat

HaziqRazali · October 1, 2019, 1:38am

Anyway if you artificially enlarge the number that dataset. len returns you will be able to.

Thank you. That give me an idea to simply take the modulo of dataset.len, allowing me to set a batch size larger then the size of the dataset. I still needed to set __len__ to return a larger number, either the length of the dataframe or the batch size.

Set the length of the dataset to be the max over the dataset length or the batch size

def __len__(self):
    return max(len(self.df),args.batch_size)

Take the modulo idx by the actual length of the data

    def __getitem__(self, idx):
        idx = idx % self.data_len

Template below

class uti_dataset(torch.utils.data.Dataset):
    def __init__(self, args, data_path):
    
    # load dataset
    self.df = pd.Dataframe()
    ...
    self.data_len = len(self.df)
            
def __len__(self):
    return max(len(self.df),args.batch_size)

def __getitem__(self, idx):
    idx = idx % self.data_len
    filenames = self.df.iloc[idx]["filepaths"]
    # load and transform data
    # ...
    return images