Training loop takes a long time each epoch using TensorDataset

tomt · October 21, 2019, 8:22am

Hi,

First of all, my dataset is loaded through a pickle file, where each variable is an np array (they are velocity components). Second, they are normalized and transformed to torch tensors. I’m training a SRGAN with low-res and high-res images btw. The dataset is around 14k images.

dataset_train = torch.utils.data.TensorDataset(LR_data_train, HR_data_train)
trainloader = torch.utils.data.DataLoader(dataset_train, batch_size=8,
                                          shuffle=True, num_workers=8, pin_memory=True)

I’m training my data on a NVIDIA Tesla V100, and it should not take 14 min each epoch, where each epoch contain around 1800 batches. I believe there is a bottleneck with slow IO speed, and was wondering if there are some workaround this?
I believe the whole dataset is read each epoch, and I was thinking about maybe creating a custom datasetloader, or put all my tensors into a HDF5 file like

ptrblck · October 21, 2019, 11:19am

If you are dealing with a data loading bottleneck, I would recommend to read this post which gives a good overview about possible reasons and workarounds.