Like TF2 tape style gradient

Maxpridy · January 7, 2021, 8:38am

Hi.
I try to convert valueDICE code(tf2) to pytorch.

google-research/google-research/blob/master/value_dice/value_dice.py

# coding=utf-8
# Copyright 2021 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Implementation of DualDICE https://openreview.net/pdf?id=SygrvzwniE ."""

import numpy as np
import tensorflow.compat.v2 as tf
from value_dice import keras_utils

This file has been truncated. show original

but I got stuck in backward.

valueDICE uses the same loss function twice pi and nu.

So I tried similarly but I’m not sure how to do it properly in pytorch.

I tried like

# assume : a = pi, b = nu
self.a_optimizer(self.a.parameters())
self.b_optimizer(self.b.parameters())

# ~~~

loss = 1234 # maybe valueDICE loss
a_loss = -loss # + pi regularization
b_loss = loss # + nu penelty

tm.a_optimizer.zero_grad()  
a_loss.backward(retain_graph=True)
tm.b_optimizer.zero_grad()
b_loss.backward()
    
tm.a_optimizer.step()  
tm.b_optimizer.step()

In this case, a does not change.

I’ve tried other methods as well, but I’m not sure how to suit it.

Could you give me a keyword on how pytorch handles these cases properly?

albanD · January 7, 2021, 5:51pm

Hi,

You don’t actually need to do two backwards if you just want to accumulate the gradients, you can do:

tm.a_optimizer.zero_grad()  
tm.b_optimizer.zero_grad()
(a_loss + b_loss).backward()

And you will get gradients for a and b parameters as long as they were used to compute loss.

Maxpridy · January 8, 2021, 3:27am

Thank you for your comment.

umm… that case, loss maybe 0. and Backpropagation doesn’t seem to happen.

The problem is that if you simply add, the main loss disappears. (One is -loss, the other is loss)

If you look at the original, it can easily find different gradients. like

a_gradient = tape.gradient(a_loss, a.parameters())
b_gradient = tape.gradient(b_loss, b.parameters())

a_optimizer.apply_gradients(zip(a_gradient, a.parameters()))
b_optimizer.apply_gradients(zip(b_gradient, b.parameters()))

As a result of my testing, it seems that partial differentiation with simple backward() is not possible.

Isn’t there a good similar method?

albanD · January 8, 2021, 2:34pm

umm… that case, loss maybe 0. and Backpropagation doesn’t seem to happen.

The loss being 0 does not imply that gradients are 0
In particular, if you have final_loss = a_loss - b_loss, the value backproped to each partial loss will be respectively 1 and -1.

Maxpridy · January 9, 2021, 1:00am

import time

import torch
import torch.nn as nn

import numpy as np


class A(nn.Module):
    def __init__(self):
        super(A, self).__init__()
        self.last = nn.Sequential(
            nn.Linear(11, 2),
            nn.LeakyReLU(),
            nn.Linear(2, 2),
            nn.LeakyReLU(),
            nn.Linear(2, 1)
        )

    def forward(self, x):
        return self.last(x)


class B(nn.Module):
    def __init__(self):
        super(B, self).__init__()
        self.last = nn.Sequential(
            nn.Linear(11, 2),
            nn.LeakyReLU(),
            nn.Linear(2, 2),
            nn.LeakyReLU(),
            nn.Linear(2, 1),    
        )

    def forward(self, x):
        x = self.last(x)
        return x


class TestModule():
    def __init__(self):
        self.a = A()
        self.b = B()

        #self.optimizer = torch.optim.Adam([{"params":self.a.parameters()}, {"params":self.b.parameters()}])
        self.a_optimizer = torch.optim.Adam(self.a.parameters())
        self.b_optimizer = torch.optim.Adam(self.b.parameters())


tm = TestModule()

for _ in range(10000):
    input = np.array([1 for _ in range(11)])

    input_tensor = torch.from_numpy(input).float()

    result_a = tm.a(input_tensor)
    result_b = tm.b(input_tensor)

    loss_a = (result_a-1)**2
    loss_b = (result_b-1)**2

    loss = (loss_a - loss_b)**2

    a_loss = -loss
    b_loss = loss

    tm.a_optimizer.zero_grad()  
    tm.b_optimizer.zero_grad()
    
    # for e in tm.a.parameters():
    #     e.requires_grad = True
    # for e in tm.b.parameters():
    #     e.requires_grad = False

    (a_loss+b_loss).backward()
    #a_loss.backward()
    print("a_backward")
    print(f"a_loss : {a_loss}")
    print(tm.a.last[0].weight.grad)
    print(tm.b.last[0].weight.grad)
    
    # for e in tm.a.parameters():
    #     e.requires_grad = False
    # for e in tm.b.parameters():
    #     e.requires_grad = True
    
    #b_loss.backward()
    print("b_backward")
    print(f"b_loss : {b_loss}")
    print(tm.a.last[0].weight.grad)
    print(tm.b.last[0].weight.grad)

    tm.a_optimizer.step()  
    tm.b_optimizer.step()
    
    print(f"a : {result_a}")
    print(f"b : {result_b}")

    time.sleep(5)

(Code for testing purposes)

I tried various things, but backprop doesn’t seem to work.

Maxpridy · January 9, 2021, 1:02am

import time

import torch
import torch.nn as nn

import numpy as np

class A(nn.Module):
    def __init__(self):
        super(A, self).__init__()
        self.last = nn.Sequential(
            nn.Linear(11, 2),
            nn.LeakyReLU(),
            nn.Linear(2, 2),
            nn.LeakyReLU(),
            nn.Linear(2, 1)
        )

    def forward(self, x):
        return self.last(x)

class B(nn.Module):
    def __init__(self):
        super(B, self).__init__()
        self.last = nn.Sequential(
            nn.Linear(11, 2),
            nn.LeakyReLU(),
            nn.Linear(2, 2),
            nn.LeakyReLU(),
            nn.Linear(2, 1),    
        )

    def forward(self, x):
        x = self.last(x)
        return x


from torch.autograd import Function

class RevGradF(Function):
    @staticmethod
    def forward(ctx, input_, alpha_):
        ctx.save_for_backward(input_, alpha_)
        output = input_
        return output

    @staticmethod
    def backward(ctx, grad_output):  # pragma: no cover
        grad_input = None
        _, alpha_ = ctx.saved_tensors
        if ctx.needs_input_grad[0]:
            grad_input = -grad_output * alpha_
        return grad_input, None

revgrad = RevGradF.apply

class RevGrad(nn.Module):
    def __init__(self, alpha=1., *args, **kwargs):
        """
        A gradient reversal layer.
        This layer has no parameters, and simply reverses the gradient
        in the backward pass.
        """
        super().__init__(*args, **kwargs)

        self._alpha = torch.tensor(alpha, requires_grad=False)

    def forward(self, input_):
        return revgrad(input_, self._alpha)


class TestModule():
    def __init__(self):
        self.a = A()
        self.b = B()

        self.revgrad = RevGrad()
        self.b = nn.Sequential(self.b, self.revgrad)

        #self.optimizer = torch.optim.Adam([{"params":self.a.parameters()}, {"params":self.b.parameters()}])
        self.a_optimizer = torch.optim.Adam(self.a.parameters())
        self.b_optimizer = torch.optim.Adam(self.b.parameters())


tm = TestModule()

for _ in range(10000):
    input = np.array([1 for _ in range(11)])

    input_tensor = torch.from_numpy(input).float()

    result_a = tm.a(input_tensor)
    result_b = tm.b(input_tensor)

    loss_a = (result_a-1)**2
    loss_b = (result_b-1)**2

    loss = (loss_a - loss_b)**2

    tm.a_optimizer.zero_grad()  
    tm.b_optimizer.zero_grad()

    # for e in tm.a.parameters():
    #     e.requires_grad = True
    # for e in tm.b.parameters():
    #     e.requires_grad = False

    (loss).backward()
    #a_loss.backward()
    # print("a_backward")
    # print(f"loss : {loss}")
    # print(tm.a.last[0].weight.grad)
    # print(tm.b[0].last[0].weight.grad)
    
    # for e in tm.a.parameters():
    #     e.requires_grad = False
    # for e in tm.b.parameters():
    #     e.requires_grad = True
    
    #b_loss.backward()
    # print("b_backward")
    # print(tm.a.last[0].weight.grad)
    # print(tm.b[0].last[0].weight.grad)

    tm.a_optimizer.step()  
    tm.b_optimizer.step()
    
    print(f"a : {result_a}")
    print(f"b : {result_b}")

    #time.sleep(5)

I heard from other places telling me to use a autograd.Function. Is this appropriate?

It seems to work.

I am wondering if it is the proper method in pytorch.

albanD · January 10, 2021, 3:50pm

Not sure why you need the custom Function? Is the goal just to flip the multiply the gradient by -alpha?

In your code above, you do

loss = XXX
(loss - loss).backward()

If you do that you won’t get any gradient because your loss is just 0 all the time.
What I was saying is to do (loss_a - loss_b).backward()

Maxpridy · January 10, 2021, 5:02pm

Thank you for answer. But we seem to be talking different things.

# Part of valueDice code
      loss = (non_linear_loss - linear_loss)

      # maybe loss.backward() ? I think this is not a problem.

      nu_loss = loss + nu_grad_penalty * nu_reg
      pi_loss = -loss + keras_utils.orthogonal_regularization(self.actor.trunk)

    nu_grads = tape.gradient(nu_loss, self.nu_net.variables) 
    pi_grads = tape.gradient(pi_loss, self.actor.variables)
    # or (nu_loss + pi_loss).backward() ?

    self.nu_optimizer.apply_gradients(zip(nu_grads, self.nu_net.variables))
    self.actor_optimizer.apply_gradients(zip(pi_grads, self.actor.variables))

In this example, it doesn’t seem like it’s possible to simply add and backward in pytorch.

The main purpose is not for a(nu) and b(pi), but we have to find the slopes for different parameters in different directions for the loss(non_linear_loss - linear_loss).

Am I misunderstanding?

albanD · January 11, 2021, 1:34pm

Ho, I think I misunderstood that the two are actually share the parameters but you want each loss to only participate to the gradient of a subset of weights.

In that case, you want to do two backwards indeed.
If you’re using the nightly build, you can either just get the grad with autograd.grad. Or if you’re using nightly pytorch, you can specify to .backward() which inputs you want the gradients to be computed for.

In particular, if you want to reproduce

    nu_grads = tape.gradient(nu_loss, self.nu_net.variables) 
    pi_grads = tape.gradient(pi_loss, self.actor.variables)

I think you just want

    # Set retain_grad in the first one because your call to backwards on the same graph
    nu_grads = autograd.grad(nu_loss, self.nu_net.variables, retain_graph=True) 
    pi_grads = autograd.grad(pi_loss, self.actor.variables)

Maxpridy · January 11, 2021, 2:32pm

oh… That’s what I was looking for!

Since receiving the your answer, I have been googled quite a bit, but there is not much content. um…

Is there any function that supports apply gradient? like apply_gradients()

    a_grads = autograd.grad(a_loss, tm.a.parameters(), retain_graph=True) 
    for layer, p in enumerate(tm.a.parameters()):
        p.grad = torch.tensor(a_grads[layer])

    b_grads = autograd.grad(b_loss, tm.b.parameters()) 
    for layer, p in enumerate(tm.b.parameters()):
        p.grad = torch.tensor(a_grads[layer])

    tm.a_optimizer.step()  
    tm.b_optimizer.step()

anyway I confirmed that this seems to be learning. Thank you for your answers so far.

albanD · January 11, 2021, 2:37pm

The way to do this is the following (working only on nightly pytorch build right now and will be in 1.8 when it comes out):

a_loss.backward(inputs=tm.a.parameters(), retain_graph=True)
b_loss.backward(inputs=tm.b.parameters())

tm.a_optimizer.step()  
tm.b_optimizer.step()

Hope this helps!
Sorry for the confusion earlier!

Maxpridy · January 11, 2021, 4:27pm

Wow. It really helped a lot. Thank you.