GPU usate decreases massively during the first epoch

Valerio_Biscione · May 3, 2022, 8:04am

Hello. I am training a fairly simple supervised model: a conv net that generates an embedding fed into an lstm.
At the beginning of training, for around halfthe first epoch, I manage to do around 40 batches a second, and GPU utilization is at around 20%. However, this number slowly decreases, and after a couple of minutes I am at around 7% of GPU util. and doing around 2-3 batches a second. It sometime goes up again for 30 seconds on so, but inevitably comes back down. The GPU memory does NOT go up during this period.

Does anyone have any idea about that?

-More Info:
Since I suspect it might be due to the nature of the dataset, here some more detail about it. I am using a custom DataLoader. Each training sample is composed of 5 elements, say (seq, array1, array2, label, image), where seq is a list of 6 PIL images 32x32. I first generate the dataset, and then save it into a pickle file. At train time, I load the pickle object, and put each element into a list.
The __getitem__ function is like this:

def __getitem__(self, idx)
      return torch.cat([self.transform(c).unsqueeze(0) for c in self.seq[idx]]), # (6, 3, 32, 32)
               torch.tensor(self.array1[idx]).float(), \ # (6, 2)
               torch.tensor(self.array2[idx]).float(),  \  # (11)
               torch.tensor(self.label[idx]).long(), \   # (0)
               self.transform(self.images[idx])   #  (3, 128, 128)

More Info:
maybe this is not really directly related to the training setup. If I stop training and restart it again straight away, the GPU is slow EVEN at the very beginning. Maybe the GPU is overheating? But the nvidia-smi shows a temperature of only 53C which seems acceptable…

ptrblck · May 3, 2022, 8:36am

This sounds indeed as if your setup is running into a potential heating issue and is downclocking the hardware. Try to measure the clocks of the GPU, CPU, etc. and see if these are decreased aver time.

Valerio_Biscione · May 3, 2022, 8:55am

Thanks, I’ll try that. Just to make sure it isn’t something with my code I tried on a different machine, and there is no slow down.