Do I need to have requires_grad=True for input when switch From Pytorch 0.3 to 1.0

isalirezag · March 11, 2019, 8:58pm

In pytorch 0.3 we used to have Variable and when training we needed to do Variable(input).
Therefore, in this way input.requires_grad became True.
so my assumption was that input.requires_grad should always be true for training. is that true?
but now im reading ‘training a classifier’ in pytorch website and see that the input.requires_grad is not specified to be true at th begining. but eventually i becomes true after sending that to the network.
So am i misunderstanding something?

ptrblck · March 11, 2019, 9:48pm

Wrapping a tensor into Variable didn’t change the requires_grad attribute to True.
You had to specify it while creating the Variable:

x = Variable(torch.randn(1), requires_grad=True)

Usually you don’t need gradients in your input. However, gradients in the input might be needed for some special use cases e.g. creating adversarial samples.

isalirezag · March 11, 2019, 10:00pm

Got it, thank you very much for the clarification

jp_letendre · March 11, 2019, 10:37pm

Speaking of adversarial samples, I’m having issues with generating gradients w.r.t. inputs. I’m currently adapting an existing virtual adversarial training code (https://github.com/naoto0804/pytorch-VAT) repository to fit my needs. I want to compute the gradient of the output of a model w.r.t. the input on the model. Unfortunately, when computing the gradient w.r.t to the input, I get “None”. The problem occurs in the VATLoss module specified below. Argument x would be the input.

class VATLoss(nn.Module):

    def __init__(self, xi=10.0, eps=1.0, ip=1):
        """VAT loss
        :param xi: hyperparameter of VAT (default: 10.0)
        :param eps: hyperparameter of VAT (default: 1.0)
        :param ip: iteration times of computing adv noise (default: 1)
        """
        super(VATLoss, self).__init__()
        self.xi = xi
        self.eps = eps
        self.ip = ip

    def forward(self, model, x):
        with torch.autograd.set_grad_enabled(False):
            pred = F.softmax(model(x), dim=1)

        # prepare random unit tensor
        d = torch.rand(x.shape, device='cuda:0').sub(0.5)
        d_ = _l2_normalize(d)

        with _disable_tracking_bn_stats(model):
            # calc adversarial direction
            for _ in range(self.ip):
                d.requires_grad = True
                pred_hat = model(x + self.xi * d_)
                logp_hat = F.log_softmax(pred_hat, dim=1)
                adv_distance = F.kl_div(logp_hat, pred, reduction='batchmean')
                adv_distance.backward()
                d = _l2_normalize(d.grad)
                model.zero_grad()

                # calc LDS
                r_adv = d * self.eps
                pred_hat = model(x + r_adv)
                logp_hat = F.log_softmax(pred_hat, dim=1)
                lds = F.kl_div(logp_hat, pred, reduction='batchmean')

        return lds

Unfortunately, d.grad returns None. When calling this function, I pass to the argument x a PyTorch nn.Parameter(unlabeled_sample, requires_grad = True). I tried using d_grad = torch.autograd.grad(adv_distance, d) instead of adv_distance.backward() to get the gradient with respect to d to it still returns None. Does this lack of gradient computation for d have to do with the fact that it’s not linked to any optimizer? Also when debugging I saw that right before adv_distance.backward(), d attributes indicate it has is_leaf = True, requires_grad = True and _version = 2, Any ideas?

I added _l2_normalize(d) to show what’s inside.

def _l2_normalize(d):
    d_reshaped = d.view(d.shape[0], -1, *(1 for _ in range(d.dim() - 2)))
    d /= torch.norm(d_reshaped, dim=1, keepdim=True) + 1e-8
    return d

I’m using PyTorch 1.0.1. Help!

JP

ptrblck · March 11, 2019, 11:06pm

Where do you print d.grad?
If I just add a print statement after adv_distance.backward(), I’ll get a valid gradient for d.
Also, if I register d as an attribute, I can print the gradient successfully.

jp_letendre · March 11, 2019, 11:21pm

I didn’t print d.grad, I was checking the attribute through the debugger of PyCharm. When you say “register d as an attribute”, you mean creating a self.d = d, in __init__(...) ?

ptrblck · March 11, 2019, 11:22pm

I just registered it in the forward method:

 def forward(self, model, x):
        with torch.autograd.set_grad_enabled(False):
            pred = F.softmax(model(x), dim=1)

        # prepare random unit tensor
        d = torch.rand(x.shape).sub(0.5)
        d_ = _l2_normalize(d)

        self.d_dummy = d
        ....

criterion = VATLoss()
model = nn.Linear(10, 10)
x = torch.randn(1, 10)
loss = criterion(model, x)
loss.backward()

print(criterion.d_dummy.grad)
>tensor([[ 0.1734, -0.1360, -0.5533, -0.0225,  0.5100, -0.3970,  0.3045, -0.3553,
         -0.0289, -0.0856]])

jp_letendre · March 11, 2019, 11:40pm

I implemented your modifications, and when I call adv_distance.backward(...) inside of the forward(...) function of VATLoss I still get None:

[...]
 def forward(self, model, x):
        with torch.autograd.set_grad_enabled(False):
            pred = F.softmax(model(x), dim=1)

        # prepare random unit tensor
        d = torch.rand(x.shape, device='cuda:0').sub(0.5)
        d_ = _l2_normalize(d)

        with _disable_tracking_bn_stats(model):
            # calc adversarial direction
            for _ in range(self.ip):
                self.dummy_d = d_
                self.dummy_d.requires_grad = True
                pred_hat = model(x + self.xi * self.dummy_d)
                logp_hat = F.log_softmax(pred_hat, dim=1)
                adv_distance = F.kl_div(logp_hat, pred, reduction='batchmean')
                adv_distance.backward(retain_graph = True)
                print(self.dummy_d.grad)
                d_ = _l2_normalize(self.dummy_d.grad)
                model.zero_grad()
[...]

The print(…) returns None. The code upstream looks like:

[...]
    model.train()
    optimizer.zero_grad()

    unlabeled_sample = nn.Parameter(unlabeled_sample, requires_grad=True)

    VATLoss_ = VATLoss()
    vat_loss_ = VATLoss_(model, unlabeled_sample) / len(unlabeled_sample)
[...]

Should I call the backward() outside of the forward() of VATLoss instead?

ptrblck · March 11, 2019, 11:49pm

I’m not sure what your code is actually doing, but I get valid gradients even inside forward:

...
adv_distance.backward()
print(d.grad)
print(self.d_dummy.grad)
...
> tensor([[ 1.7061, -0.8534,  0.1437,  0.7026,  0.3845,  0.9892,  1.0329, -1.8058,
         -2.0117, -0.1880]])
tensor([[ 1.7061, -0.8534,  0.1437,  0.7026,  0.3845,  0.9892,  1.0329, -1.8058,
         -2.0117, -0.1880]])

It looks like you are trying to create some adversarial sample inside the forward of VATLoss, but apparently you are recreating it in each iteration from scratch?
I think it might be better to create a new topic and explain your use case a bit more so that others might have a look at this issue.