Gradient is None for some Parameters

For adversial training, I want to perturbate word embeddings. Oversimplified, the problematic part of my code looks like this:

optimizer.zero_grad()
emb = embedder(token_ids)
emb = Variable(emb, requires_grad=True)
logits = cls_head(emb)
loss = loss_func(logits, target) 
loss.backward()
optimizer.step()
print(emb.weight.grad) >>> None
print(cls_head.weight.grad) >>> Tensor

gradient for cls_head is calculated, but gradient for embedder is None. Why is the gradient of embedder None and how can I calculate the gradient w.r.t. emb without ending in this problem?

Kind Regards,
Milan

You are detaching emb from the computation graph by re-wrapping it into the deprecated Variable:

emb = Variable(emb, requires_grad=True)

Remove this line of code to keep the computation graph alive.

Is there aby other way to get the gradient with respect to emb?
I need to get the gradient to both, emb and the parametere within the embedder (for some perturbation).

I don’t quite understand the question. In your current code you are implicitly detaching the computation graph by recreating the tensor. If you remove this line of code, it should work without any other changes.

Since you mentioned this

I created a short simulation like this:

model = nn.Sequential(nn.Linear(2,1))
emb = model(torch.tensor([1.0, 2]))
emb = torch.autograd.Variable(emb, requires_grad=True)
loss_fn = nn.MSELoss()
loss = loss_fn(emb, torch.tensor([1.0]))
loss.backward()
print(emb.grad)
print(model[0].weight.grad)

gives

tensor([0.0333])
None

and according to this emb should have its grad attribute populated, but as pointed out by @ptrblck you explicitly detach emb from the graph and hence the model (embedder) parameters’ grad shan’t be populated.

My goal is to transfer “fast gradient sign method (FGSM)” to nlp. FGSM calculates the gradient w.r.t. input images and later uses that in an adversial update step.

Calculating the gradient w.r.t. a sequence of (int-valued) token sequences doesn’t work, so my idea was to instead calculate the gradient w.r.t. the token embeddings, so that the token embeddings can later on be perturbated for the adversial uodate step.

That’s why i need to calculate the gradients to both, the embedder weights (for updating them) and the embeddings (for the perturbation). Is there any way to achieve this?

Is removing the line of code, mentioned a few times already, which recreates the tensor not working? If not, why?

But then it doesn’t require a grad, i.e. I can’t calculate the gradient with respect to emb then

That’s not the case unless you already detached the computation graph before or disabled the gradient calculation through any context manager such as with torch.no_grad().
This code works properly:

embedder = nn.Embedding(10, 10)
cls_head = nn.Linear(10, 10)
token_ids = torch.randint(0, 10, (10,))

loss_func = nn.MSELoss()

emb = embedder(token_ids)
logits = cls_head(emb)

loss = loss_func(logits, torch.randn_like(logits)) 
loss.backward()

print(embedder.weight.grad)
print(cls_head.weight.grad)

as it’s not detaching the computation graph and prints valid gradients for both modules.

But emb.grad doesn’t exist then

emb is an intermediate forward activation and doesn’t have a weight attribute.
If you want to check the gradient from this activation, call .retain_grad() on it:

embedder = nn.Embedding(10, 10)
cls_head = nn.Linear(10, 10)
token_ids = torch.randint(0, 10, (10,))

loss_func = nn.MSELoss()

emb = embedder(token_ids)
emb.retain_grad()
logits = cls_head(emb)

loss = loss_func(logits, torch.randn_like(logits)) 
loss.backward()

print(embedder.weight.grad)
print(cls_head.weight.grad)
print(emb.grad)
1 Like