How to get the gradient of loss function twice

aerinykim · July 29, 2018, 7:37am

here is what I’m trying to implement:

We calculate loss based on F(X), as usual. And we also define “adversarial loss” which is a loss based on F(X + e). e is defined as dF(X)/dX multiplied by some constant. Both loss and adversarial loss are backpropagated for the total loss.

In tensorflow, this part (getting dF(X)/dX) can be coded like below:

grad, = tf.gradients( loss, X )
grad = tf.stop_gradient(grad)
e = constant * grad

Below is my pytorch code:

class DocReaderModel(object):
    def __init__(self, embedding=None, state_dict=None):
        self.train_loss = AverageMeter()
        self.embedding = embedding
        self.network = DNetwork(opt, embedding)
        self.optimizer = optim.SGD(parameters)

    def adversarial_loss(self, batch, loss, embedding, y):
        self.optimizer.zero_grad()
        loss.backward(retain_graph=True)
        grad = embedding.grad
        grad.detach_()

        perturb = F.normalize(grad, p=2)* 0.5
        self.optimizer.zero_grad()
        adv_embedding = embedding + perturb
        network_temp = DNetwork(self.opt, adv_embedding) # This is how to get F(X)
        network_temp.training = False
        network_temp.cuda()
        start, end, _ = network_temp(batch) # This is how to get F(X)
        del network_temp # I even deleted this instance.
        return F.cross_entropy(start, y[0]) + F.cross_entropy(end, y[1])

    def update(self, batch):
        self.network.train()
        start, end, pred = self.network(batch)
        loss = F.cross_entropy(start, y[0]) + F.cross_entropy(end, y[1])
        loss_adv = self.adversarial_loss(batch, loss, self.network.lexicon_encoder.embedding.weight, y) 
        loss_total = loss + loss_adv 

        self.optimizer.zero_grad()
        loss_total.backward()
        self.optimizer.step()

I have few questions:

I substituted tf.stop_gradient with grad.detach_(). Is this correct?
I was getting "RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time." so I added retain_graph=True at the loss.backward. That specific error went away. However now I’m getting a memory error after few epochs (RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/THC/generic/THCStorage.cu:58 ). I suspect I’m unnecessarily retaining graph.

Can someone @albanD let me know pytorch’s best practice on this? Any hint / even short comment will be highly appreciated.

albanD · July 29, 2018, 12:21pm

Hi,

Yes using .detach() is the right way to stop gradients from flowing back.

To get the gradients, you don’t have to make a full backward, you can use torch.autograd.grad to get the gradients for specific tensors, here embedding for example. You will need to give create_graph=True (which implies will set retain_graph=True).

Also I am not sure why you set training=False on the new net you create?

The last .backward() in the update function should not have a retain_graph=True. If you get an error here, that means that accross two calls to update, some autograd operations are shared (and they shouldn’t).