I am trying to train a model with a dataset containinng ~11 million samples of 1d vectors contained in an HDF5 format file. It all runs ok with smaller datasets, but when I try with the full dataset, I find the training just hangs and I cannot get through even a single epoch after several hours of running on a GPU. I am wondering if I am exceeding system memory even though I think I am doing lazy loading. The code I am using is similar to below:
class ConcatDataset(torch.utils.data.Dataset):
def __init__(self, xdata, ydata):
self.xdatasets = xdata
self.ydatasets = ydata
def __getitem__(self, i):
xd = [torch.tensor(d[i]) for d in self.xdatasets]
x = torch.cat(xd)
y = torch.cat([torch.tensor(d[i]) for d in self.ydatasets])
return (x.to(device),y.to(device))
def __len__(self):
return min(len(d) for d in self.xdatasets)
train_loader = torch.utils.data.DataLoader(
ConcatDataset(x,y),
batch_size=args.batch_size, shuffle=True)
I am after some advice that I am not missing something obvious above. I am trying with a batch size of 100.