I use try-catch to enclose the forward and backward functions, and I also delete all the tensors after every batch.
try:
decoder_output, loss = model(batch)
if Q.qsize() > 0:
break
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_loss.append(loss.mean().item())
del decoder_output, loss
except Exception as e:
optimizer.zero_grad()
for p in model.parameters():
if p.grad is not None:
del p.grad
torch.cuda.empty_cache()
oom += 1
del batch
for p in model.parameters():
if p.grad is not None:
del p.grad
torch.cuda.empty_cache()
The training process is normal at the first thousands of steps, even if it got OOM exception, the exception will be catched and the GPU memory will be released.
But after I trained thousands of batches, it suddenly keeps getting OOM for every batch and the memory seems never be released anymore.
It’s so weird to me, is there any suggestions? (I’m using distributed data parallel)