In pytorch 0.3 we used to have Variable and when training we needed to do Variable(input).
Therefore, in this way input.requires_grad
became True.
so my assumption was that input.requires_grad
should always be true for training. is that true?
but now im reading ‘training a classifier’ in pytorch website and see that the input.requires_grad
is not specified to be true at th begining. but eventually i becomes true after sending that to the network.
So am i misunderstanding something?
Wrapping a tensor
into Variable
didn’t change the requires_grad
attribute to True
.
You had to specify it while creating the Variable
:
x = Variable(torch.randn(1), requires_grad=True)
Usually you don’t need gradients in your input. However, gradients in the input might be needed for some special use cases e.g. creating adversarial samples.
Got it, thank you very much for the clarification
Speaking of adversarial samples, I’m having issues with generating gradients w.r.t. inputs. I’m currently adapting an existing virtual adversarial training code (https://github.com/naoto0804/pytorch-VAT) repository to fit my needs. I want to compute the gradient of the output of a model w.r.t. the input on the model. Unfortunately, when computing the gradient w.r.t to the input, I get “None”. The problem occurs in the VATLoss module specified below. Argument x
would be the input.
class VATLoss(nn.Module):
def __init__(self, xi=10.0, eps=1.0, ip=1):
"""VAT loss
:param xi: hyperparameter of VAT (default: 10.0)
:param eps: hyperparameter of VAT (default: 1.0)
:param ip: iteration times of computing adv noise (default: 1)
"""
super(VATLoss, self).__init__()
self.xi = xi
self.eps = eps
self.ip = ip
def forward(self, model, x):
with torch.autograd.set_grad_enabled(False):
pred = F.softmax(model(x), dim=1)
# prepare random unit tensor
d = torch.rand(x.shape, device='cuda:0').sub(0.5)
d_ = _l2_normalize(d)
with _disable_tracking_bn_stats(model):
# calc adversarial direction
for _ in range(self.ip):
d.requires_grad = True
pred_hat = model(x + self.xi * d_)
logp_hat = F.log_softmax(pred_hat, dim=1)
adv_distance = F.kl_div(logp_hat, pred, reduction='batchmean')
adv_distance.backward()
d = _l2_normalize(d.grad)
model.zero_grad()
# calc LDS
r_adv = d * self.eps
pred_hat = model(x + r_adv)
logp_hat = F.log_softmax(pred_hat, dim=1)
lds = F.kl_div(logp_hat, pred, reduction='batchmean')
return lds
Unfortunately, d.grad
returns None
. When calling this function, I pass to the argument x
a PyTorch nn.Parameter(unlabeled_sample, requires_grad = True)
. I tried using d_grad = torch.autograd.grad(adv_distance, d)
instead of adv_distance.backward()
to get the gradient with respect to d
to it still returns None
. Does this lack of gradient computation for d have to do with the fact that it’s not linked to any optimizer? Also when debugging I saw that right before adv_distance.backward()
, d
attributes indicate it has is_leaf = True
, requires_grad = True
and _version = 2
, Any ideas?
I added _l2_normalize(d)
to show what’s inside.
def _l2_normalize(d):
d_reshaped = d.view(d.shape[0], -1, *(1 for _ in range(d.dim() - 2)))
d /= torch.norm(d_reshaped, dim=1, keepdim=True) + 1e-8
return d
I’m using PyTorch 1.0.1. Help!
JP
Where do you print d.grad
?
If I just add a print statement after adv_distance.backward()
, I’ll get a valid gradient for d
.
Also, if I register d
as an attribute, I can print the gradient successfully.
I didn’t print d.grad
, I was checking the attribute through the debugger of PyCharm. When you say “register d
as an attribute”, you mean creating a self.d = d
, in __init__(...)
?
I just registered it in the forward
method:
def forward(self, model, x):
with torch.autograd.set_grad_enabled(False):
pred = F.softmax(model(x), dim=1)
# prepare random unit tensor
d = torch.rand(x.shape).sub(0.5)
d_ = _l2_normalize(d)
self.d_dummy = d
....
criterion = VATLoss()
model = nn.Linear(10, 10)
x = torch.randn(1, 10)
loss = criterion(model, x)
loss.backward()
print(criterion.d_dummy.grad)
>tensor([[ 0.1734, -0.1360, -0.5533, -0.0225, 0.5100, -0.3970, 0.3045, -0.3553,
-0.0289, -0.0856]])
I implemented your modifications, and when I call adv_distance.backward(...)
inside of the forward(...)
function of VATLoss I still get None
:
[...]
def forward(self, model, x):
with torch.autograd.set_grad_enabled(False):
pred = F.softmax(model(x), dim=1)
# prepare random unit tensor
d = torch.rand(x.shape, device='cuda:0').sub(0.5)
d_ = _l2_normalize(d)
with _disable_tracking_bn_stats(model):
# calc adversarial direction
for _ in range(self.ip):
self.dummy_d = d_
self.dummy_d.requires_grad = True
pred_hat = model(x + self.xi * self.dummy_d)
logp_hat = F.log_softmax(pred_hat, dim=1)
adv_distance = F.kl_div(logp_hat, pred, reduction='batchmean')
adv_distance.backward(retain_graph = True)
print(self.dummy_d.grad)
d_ = _l2_normalize(self.dummy_d.grad)
model.zero_grad()
[...]
The print(…) returns None
. The code upstream looks like:
[...]
model.train()
optimizer.zero_grad()
unlabeled_sample = nn.Parameter(unlabeled_sample, requires_grad=True)
VATLoss_ = VATLoss()
vat_loss_ = VATLoss_(model, unlabeled_sample) / len(unlabeled_sample)
[...]
Should I call the backward()
outside of the forward()
of VATLoss instead?
I’m not sure what your code is actually doing, but I get valid gradients even inside forward
:
...
adv_distance.backward()
print(d.grad)
print(self.d_dummy.grad)
...
> tensor([[ 1.7061, -0.8534, 0.1437, 0.7026, 0.3845, 0.9892, 1.0329, -1.8058,
-2.0117, -0.1880]])
tensor([[ 1.7061, -0.8534, 0.1437, 0.7026, 0.3845, 0.9892, 1.0329, -1.8058,
-2.0117, -0.1880]])
It looks like you are trying to create some adversarial sample inside the forward
of VATLoss
, but apparently you are recreating it in each iteration from scratch?
I think it might be better to create a new topic and explain your use case a bit more so that others might have a look at this issue.