DataLoader not returning proper batch

erlebach · July 8, 2022, 3:51pm

Hi! I need help on DataLoader to resolve what is probably a simple misconception. Below, I have a MWE illustrating my use case. I input a list of three arrays of size 1024 to a custom Dataset, which will return element idx from each array as a sequence of three elements. I have checked that Dataset by iterating through it and confirmed that I could retrieve elements 0 through 1023. When using a DataLoader with batch_size=100, I expect the DataLoader to return batches of size 100. However, I only get one batch of size 1. I thought I understood how the DataLoader was supposed to work, but obviously not. Any insight is appreciated. Thanks.

import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

# Create 3 arrays of size 1024
a = np.random.randn(3, 1024)
print(a.shape) # 3,1024
a1, a2, a3 = a[0], a[1], a[2]
print(a1.shape) # 1024

class myDataset(Dataset):
    """
    Parameters:
    ----------
    A list of numpy arrays
    """
    
    def __init__(self, data):
        assert isinstance(data, list), "myDataset: argument must be of type list"
        self.data = data
       
    def __getitem__(self, idx):
        return tuple(data[idx] for data in self.data)
    
    def __len__(self):
        return len(self.data)

data = myDataset([a1, a2, a3])
data_iter = DataLoader(data, batch_size=100, shuffle=False)

for index, values in enumerate(data_iter):
    print("index= ", index)
    print("values= ", values)

# output of the for loop: 
# index=  0
# values=  [tensor([-0.4421, -0.4562,  1.2012], dtype=torch.float64), tensor([-0.8228, -0.7304,  # 0.6380], dtype=torch.float64), tensor([ 1.2241,  0.4840, -0.0031], dtype=torch.float64)]

# I expected to collect about 10 batches of size 100. 
# However, I only collect the equivalent of a batch of size 1, and the for loop has only a single iteration.

ptrblck · July 9, 2022, 4:43am

Thanks for the great MWE!
The issue is a bit tricky to find, since iterating the Dataset works while the DataLoader only returns a single batch. However, after checking the implementation you can see that len(data) returns 3 and is then used by the DataLoader to define the used indices to create the single batch.
In your Dataset.__len__ function you are returning len(self.data) which is the length of the list of the arrays (and thus 3).
Use return len(self.data[0]) (or the min of all arrays in case these are different in your real use case) and it will work.

erlebach · July 9, 2022, 10:25am

Thanks for the solution! This works. An alternative solution is to use TensorDataset.

Gordon