Like TF2 tape style gradient

Hi.
I try to convert valueDICE code(tf2) to pytorch.

but I got stuck in backward.

valueDICE uses the same loss function twice pi and nu.

So I tried similarly but I’m not sure how to do it properly in pytorch.

I tried like

# assume : a = pi, b = nu
self.a_optimizer(self.a.parameters())
self.b_optimizer(self.b.parameters())

# ~~~

loss = 1234 # maybe valueDICE loss
a_loss = -loss # + pi regularization
b_loss = loss # + nu penelty

tm.a_optimizer.zero_grad()  
a_loss.backward(retain_graph=True)
tm.b_optimizer.zero_grad()
b_loss.backward()
    
tm.a_optimizer.step()  
tm.b_optimizer.step()  
    

In this case, a does not change.

I’ve tried other methods as well, but I’m not sure how to suit it.

Could you give me a keyword on how pytorch handles these cases properly?

Hi,

You don’t actually need to do two backwards if you just want to accumulate the gradients, you can do:

tm.a_optimizer.zero_grad()  
tm.b_optimizer.zero_grad()
(a_loss + b_loss).backward()

And you will get gradients for a and b parameters as long as they were used to compute loss.

Thank you for your comment.

umm… that case, loss maybe 0. and Backpropagation doesn’t seem to happen.

The problem is that if you simply add, the main loss disappears. (One is -loss, the other is loss)

If you look at the original, it can easily find different gradients. like

a_gradient = tape.gradient(a_loss, a.parameters())
b_gradient = tape.gradient(b_loss, b.parameters())

a_optimizer.apply_gradients(zip(a_gradient, a.parameters()))
b_optimizer.apply_gradients(zip(b_gradient, b.parameters()))

As a result of my testing, it seems that partial differentiation with simple backward() is not possible.

Isn’t there a good similar method?

umm… that case, loss maybe 0. and Backpropagation doesn’t seem to happen.

The loss being 0 does not imply that gradients are 0 :wink:
In particular, if you have final_loss = a_loss - b_loss, the value backproped to each partial loss will be respectively 1 and -1.

import time

import torch
import torch.nn as nn

import numpy as np


class A(nn.Module):
    def __init__(self):
        super(A, self).__init__()
        self.last = nn.Sequential(
            nn.Linear(11, 2),
            nn.LeakyReLU(),
            nn.Linear(2, 2),
            nn.LeakyReLU(),
            nn.Linear(2, 1)
        )

    def forward(self, x):
        return self.last(x)


class B(nn.Module):
    def __init__(self):
        super(B, self).__init__()
        self.last = nn.Sequential(
            nn.Linear(11, 2),
            nn.LeakyReLU(),
            nn.Linear(2, 2),
            nn.LeakyReLU(),
            nn.Linear(2, 1),    
        )

    def forward(self, x):
        x = self.last(x)
        return x


class TestModule():
    def __init__(self):
        self.a = A()
        self.b = B()

        #self.optimizer = torch.optim.Adam([{"params":self.a.parameters()}, {"params":self.b.parameters()}])
        self.a_optimizer = torch.optim.Adam(self.a.parameters())
        self.b_optimizer = torch.optim.Adam(self.b.parameters())


tm = TestModule()

for _ in range(10000):
    input = np.array([1 for _ in range(11)])

    input_tensor = torch.from_numpy(input).float()

    result_a = tm.a(input_tensor)
    result_b = tm.b(input_tensor)

    loss_a = (result_a-1)**2
    loss_b = (result_b-1)**2

    loss = (loss_a - loss_b)**2

    a_loss = -loss
    b_loss = loss

    tm.a_optimizer.zero_grad()  
    tm.b_optimizer.zero_grad()
    
    # for e in tm.a.parameters():
    #     e.requires_grad = True
    # for e in tm.b.parameters():
    #     e.requires_grad = False

    (a_loss+b_loss).backward()
    #a_loss.backward()
    print("a_backward")
    print(f"a_loss : {a_loss}")
    print(tm.a.last[0].weight.grad)
    print(tm.b.last[0].weight.grad)
    
    # for e in tm.a.parameters():
    #     e.requires_grad = False
    # for e in tm.b.parameters():
    #     e.requires_grad = True
    
    #b_loss.backward()
    print("b_backward")
    print(f"b_loss : {b_loss}")
    print(tm.a.last[0].weight.grad)
    print(tm.b.last[0].weight.grad)

    tm.a_optimizer.step()  
    tm.b_optimizer.step()
    
    print(f"a : {result_a}")
    print(f"b : {result_b}")

    time.sleep(5)

(Code for testing purposes)

I tried various things, but backprop doesn’t seem to work.

import time

import torch
import torch.nn as nn

import numpy as np

class A(nn.Module):
    def __init__(self):
        super(A, self).__init__()
        self.last = nn.Sequential(
            nn.Linear(11, 2),
            nn.LeakyReLU(),
            nn.Linear(2, 2),
            nn.LeakyReLU(),
            nn.Linear(2, 1)
        )

    def forward(self, x):
        return self.last(x)

class B(nn.Module):
    def __init__(self):
        super(B, self).__init__()
        self.last = nn.Sequential(
            nn.Linear(11, 2),
            nn.LeakyReLU(),
            nn.Linear(2, 2),
            nn.LeakyReLU(),
            nn.Linear(2, 1),    
        )

    def forward(self, x):
        x = self.last(x)
        return x


from torch.autograd import Function

class RevGradF(Function):
    @staticmethod
    def forward(ctx, input_, alpha_):
        ctx.save_for_backward(input_, alpha_)
        output = input_
        return output

    @staticmethod
    def backward(ctx, grad_output):  # pragma: no cover
        grad_input = None
        _, alpha_ = ctx.saved_tensors
        if ctx.needs_input_grad[0]:
            grad_input = -grad_output * alpha_
        return grad_input, None

revgrad = RevGradF.apply

class RevGrad(nn.Module):
    def __init__(self, alpha=1., *args, **kwargs):
        """
        A gradient reversal layer.
        This layer has no parameters, and simply reverses the gradient
        in the backward pass.
        """
        super().__init__(*args, **kwargs)

        self._alpha = torch.tensor(alpha, requires_grad=False)

    def forward(self, input_):
        return revgrad(input_, self._alpha)


class TestModule():
    def __init__(self):
        self.a = A()
        self.b = B()

        self.revgrad = RevGrad()
        self.b = nn.Sequential(self.b, self.revgrad)

        #self.optimizer = torch.optim.Adam([{"params":self.a.parameters()}, {"params":self.b.parameters()}])
        self.a_optimizer = torch.optim.Adam(self.a.parameters())
        self.b_optimizer = torch.optim.Adam(self.b.parameters())


tm = TestModule()

for _ in range(10000):
    input = np.array([1 for _ in range(11)])

    input_tensor = torch.from_numpy(input).float()

    result_a = tm.a(input_tensor)
    result_b = tm.b(input_tensor)

    loss_a = (result_a-1)**2
    loss_b = (result_b-1)**2

    loss = (loss_a - loss_b)**2

    tm.a_optimizer.zero_grad()  
    tm.b_optimizer.zero_grad()

    # for e in tm.a.parameters():
    #     e.requires_grad = True
    # for e in tm.b.parameters():
    #     e.requires_grad = False

    (loss).backward()
    #a_loss.backward()
    # print("a_backward")
    # print(f"loss : {loss}")
    # print(tm.a.last[0].weight.grad)
    # print(tm.b[0].last[0].weight.grad)
    
    # for e in tm.a.parameters():
    #     e.requires_grad = False
    # for e in tm.b.parameters():
    #     e.requires_grad = True
    
    #b_loss.backward()
    # print("b_backward")
    # print(tm.a.last[0].weight.grad)
    # print(tm.b[0].last[0].weight.grad)

    tm.a_optimizer.step()  
    tm.b_optimizer.step()
    
    print(f"a : {result_a}")
    print(f"b : {result_b}")

    #time.sleep(5)

I heard from other places telling me to use a autograd.Function. Is this appropriate?

It seems to work.

I am wondering if it is the proper method in pytorch.

Not sure why you need the custom Function? Is the goal just to flip the multiply the gradient by -alpha?

In your code above, you do

loss = XXX
(loss - loss).backward()

If you do that you won’t get any gradient because your loss is just 0 all the time.
What I was saying is to do (loss_a - loss_b).backward()

Thank you for answer. But we seem to be talking different things.

# Part of valueDice code
      loss = (non_linear_loss - linear_loss)

      # maybe loss.backward() ? I think this is not a problem.

      nu_loss = loss + nu_grad_penalty * nu_reg
      pi_loss = -loss + keras_utils.orthogonal_regularization(self.actor.trunk)

    nu_grads = tape.gradient(nu_loss, self.nu_net.variables) 
    pi_grads = tape.gradient(pi_loss, self.actor.variables)
    # or (nu_loss + pi_loss).backward() ?

    self.nu_optimizer.apply_gradients(zip(nu_grads, self.nu_net.variables))
    self.actor_optimizer.apply_gradients(zip(pi_grads, self.actor.variables))

In this example, it doesn’t seem like it’s possible to simply add and backward in pytorch.

The main purpose is not for a(nu) and b(pi), but we have to find the slopes for different parameters in different directions for the loss(non_linear_loss - linear_loss).

Am I misunderstanding?

Ho, I think I misunderstood that the two are actually share the parameters but you want each loss to only participate to the gradient of a subset of weights.

In that case, you want to do two backwards indeed.
If you’re using the nightly build, you can either just get the grad with autograd.grad. Or if you’re using nightly pytorch, you can specify to .backward() which inputs you want the gradients to be computed for.

In particular, if you want to reproduce

    nu_grads = tape.gradient(nu_loss, self.nu_net.variables) 
    pi_grads = tape.gradient(pi_loss, self.actor.variables)

I think you just want

    # Set retain_grad in the first one because your call to backwards on the same graph
    nu_grads = autograd.grad(nu_loss, self.nu_net.variables, retain_graph=True) 
    pi_grads = autograd.grad(pi_loss, self.actor.variables)

oh… That’s what I was looking for!

Since receiving the your answer, I have been googled quite a bit, but there is not much content. um…

Is there any function that supports apply gradient? like apply_gradients()

    a_grads = autograd.grad(a_loss, tm.a.parameters(), retain_graph=True) 
    for layer, p in enumerate(tm.a.parameters()):
        p.grad = torch.tensor(a_grads[layer])

    b_grads = autograd.grad(b_loss, tm.b.parameters()) 
    for layer, p in enumerate(tm.b.parameters()):
        p.grad = torch.tensor(a_grads[layer])

    tm.a_optimizer.step()  
    tm.b_optimizer.step()

anyway I confirmed that this seems to be learning. Thank you for your answers so far.

The way to do this is the following (working only on nightly pytorch build right now and will be in 1.8 when it comes out):

a_loss.backward(inputs=tm.a.parameters(), retain_graph=True)
b_loss.backward(inputs=tm.b.parameters())

tm.a_optimizer.step()  
tm.b_optimizer.step()

Hope this helps!
Sorry for the confusion earlier!

Wow. It really helped a lot. Thank you.