Recently I need to double backpropagate on the gradients of the embedding layers for NLP tasks.
I essentially have 2 ways of doing it. One with autograd.grad
, another with register_backward_hook
on the embedding layer.
But here is the thing,
If I just use autograd.grad
to get the gradient with respect to the embedding layer, let’s call this W’, then I can call backward on W’ without any memory leak. The problem is that this is not exactly what I wanted. It could be off because if there are the same tokens in the sentence, the gradient returned for that token would be the sum of the gradients of that token in multiple occurrences.
So the only way to do it as far as I know, in this case, would be to backward on the gradients captured by a hook. This is achieved by setting a hook at the embedding layer using register_backward_hook
, and then call either .backward(create_graph=True)
or autograd.grad(embedding_layer, loss, create_graph=True)
. Then the gradients would be captured by the hook and they are actually separated rather than summed up for the same tokens. However, this incurs a memory leak of the GPU.
Here is like a minimized code that causes it:
import torch
import os
import numpy as np
np.random.seed(0)
torch.manual_seed(0)
os.environ['CUDA_VISIBLE_DEVICES']="1"
dev = torch.device('cuda')
embedding_gradient = []
def hook_layer(module, grad_in, grad_out):
embedding_gradient.append(grad_out[0])
arr = np.random.randint(10, size=3)
random_ix = torch.LongTensor(arr).to(dev)
print(arr)
embedding_layer = torch.nn.Embedding(10, 5, padding_idx=0).to(dev)
net = torch.nn.Linear(5,1).to(dev)
for i in range(1000):
print(i)
print(torch.cuda.memory_summary(device=0, abbreviated=True))
#1 set the hook to embedding layer
embedding_gradient = []
hook = embedding_layer.register_backward_hook(hook_layer)
#2 forward pass
embeds = embedding_layer(random_ix)
out = net(embeds)
#3 backward pass
summed = out.norm(2)
summed.backward(create_graph=True)
#4 remove the hook
hook.remove()
final = embedding_gradient[0].sum()
final.backward()
if I use grad_auto = torch.autograd.grad(summed, embedding_layer.weight, create_graph=True)
rather than summed.backward(create_graph=True)
, the memory leak goes away.
But it is not the case in my actual code. In my code, either way there would be a memory leak. It would be great if anyone knows how to solve this issue or kowns how to circumvent it. I am using 1.5 torch, cuda 10.0 with a GTX 1080ti GPU. This is also reproducible on 1.4. Thanks in advance!