Variable size of batches for training?

Wei_Chen · October 10, 2018, 2:45pm

Hi, I have a training set which I want to divide into batches of variable sizes based on the index list (batch 1 would contain data with index 1 to 100, and batch 2 contains index 101 to 129, batch 3 contains index 130 to 135, …, for instance). I check dataloader but it seems to me that it only supports fixed-size batches. I wonder what would be a good way to do that?

Thank you!

nwesemann · October 10, 2018, 2:48pm

Why don’t you shuffle your data and drop the last samples?

Wei_Chen · October 10, 2018, 2:51pm

Because I want to keep the order fixed, such that a specific batch contains data exactly specified by the index list. For my example above, batch 1 should only contain data with index 1 to 100, not 100 random data points. Same for batch 2,3,…

ptrblck · October 10, 2018, 10:01pm

Do you know these lengths beforehand?
If so, you could use these indices to slice your data, set batch_size=1 and view your data to fake your batch size:

class MyDataset(Dataset):
    def __init__(self):
        self.data = torch.randn(250, 1)
        self.batch_indices = [0, 100, 129, 150, 200, 250]

    def __getitem__(self, index):
        start_idx = self.batch_indices[index]
        end_idx = self.batch_indices[index+1]
        data = self.data[start_idx:end_idx]
        return data
        
    def __len__(self):
        return len(self.batch_indices) - 1


dataset = MyDataset()
loader = DataLoader(
    dataset,
    batch_size=1,
    shuffle=False,
    num_workers=2
)

for data in loader:
    data = data.view(-1, 1)
    print(data.shape)