I have a model with multiple outputs and, therefore, multiple losses. When training I accumulate the losses using retain_graph. Something along the lines of:
self.zero_grad()
for output_label, output in self(input, target).items():
loss = self.loss(output, target[output_label])
loss.backward(retain_graph=True)
self.optimizer.step()
where input, output and target are dictionaries with the respective data for the different inputs and losses.
I am using Adam for the optimization.
I’ve noticed that after a number of epochs, the running time of an epoch goes suddenly up from 7sec to 34sec.
I also noticed a slowdown of CPU usage in my computer (I haven’t test this yet on the GPU). Memory usage doesn’t seem to increase.
I profiled the code and I saw this (output from cProfile):
Normal epoch:
52 0.012 0.000 3.538 0.068 adam.py:30(step)
624 0.646 0.001 0.646 0.001 {method 'addcdiv_' of 'torch._C.FloatTensorBase' objects}
Slow epoch:
52 0.013 0.000 24.576 0.473 adam.py:30(step)
624 21.469 0.034 21.469 0.034 {method 'addcdiv_' of 'torch._C.FloatTensorBase' objects}
I’ve tested with other adaptive losses like Adagrad, and there I can’t see the issue.
It seems to be related to this line of code in Adam.step():
p.data.addcdiv_(-step_size, exp_avg, denom)
Any ideas about why this is happening? It seems like suddenly the size of the accumulated gradient explodes, but I can’t see why.