How to get the gradient of loss function twice

here is what I’m trying to implement:

We calculate loss based on F(X), as usual. And we also define “adversarial loss” which is a loss based on F(X + e). e is defined as dF(X)/dX multiplied by some constant. Both loss and adversarial loss are backpropagated for the total loss.

In tensorflow, this part (getting dF(X)/dX) can be coded like below:

grad, = tf.gradients( loss, X )
grad = tf.stop_gradient(grad)
e = constant * grad

Below is my pytorch code:

class DocReaderModel(object):
    def __init__(self, embedding=None, state_dict=None):
        self.train_loss = AverageMeter()
        self.embedding = embedding
        self.network = DNetwork(opt, embedding)
        self.optimizer = optim.SGD(parameters)

    def adversarial_loss(self, batch, loss, embedding, y):
        self.optimizer.zero_grad()
        loss.backward(retain_graph=True)
        grad = embedding.grad
        grad.detach_()

        perturb = F.normalize(grad, p=2)* 0.5
        self.optimizer.zero_grad()
        adv_embedding = embedding + perturb
        network_temp = DNetwork(self.opt, adv_embedding) # This is how to get F(X)
        network_temp.training = False
        network_temp.cuda()
        start, end, _ = network_temp(batch) # This is how to get F(X)
        del network_temp # I even deleted this instance.
        return F.cross_entropy(start, y[0]) + F.cross_entropy(end, y[1])

    def update(self, batch):
        self.network.train()
        start, end, pred = self.network(batch)
        loss = F.cross_entropy(start, y[0]) + F.cross_entropy(end, y[1])
        loss_adv = self.adversarial_loss(batch, loss, self.network.lexicon_encoder.embedding.weight, y) 
        loss_total = loss + loss_adv 

        self.optimizer.zero_grad()
        loss_total.backward()
        self.optimizer.step()

I have few questions:

  1. I substituted tf.stop_gradient with grad.detach_(). Is this correct?

  2. I was getting "RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time." so I added retain_graph=True at the loss.backward. That specific error went away. However now I’m getting a memory error after few epochs (RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/THC/generic/THCStorage.cu:58 ). I suspect I’m unnecessarily retaining graph.

Can someone @albanD let me know pytorch’s best practice on this? Any hint / even short comment will be highly appreciated.

Hi,

  1. Yes using .detach() is the right way to stop gradients from flowing back.

To get the gradients, you don’t have to make a full backward, you can use torch.autograd.grad to get the gradients for specific tensors, here embedding for example. You will need to give create_graph=True (which implies will set retain_graph=True).

Also I am not sure why you set training=False on the new net you create?

The last .backward() in the update function should not have a retain_graph=True. If you get an error here, that means that accross two calls to update, some autograd operations are shared (and they shouldn’t).

1 Like