Gradient scaling, reversal

AlbertZeyer · August 15, 2023, 9:27am

I wonder about the best way how to implement gradient reversal, or in general gradient scaling (reversal is the special case of using factor -1 then).

github.com

facebookresearch/fairseq/blob/100cd91db19bb/fairseq/modules/grad_multiply.py

# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

import torch


class GradMultiply(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x, scale):
        ctx.scale = scale
        res = x.new(x)
        return res

    @staticmethod
    def backward(ctx, grad):
        return grad * ctx.scale, None

github.com

janfreyberg/pytorch-revgrad/blob/449fa763a76d/src/pytorch_revgrad/functional.py

from torch.autograd import Function


class RevGrad(Function):
    @staticmethod
    def forward(ctx, input_, alpha_):
        ctx.save_for_backward(input_, alpha_)
        output = input_
        return output

    @staticmethod
    def backward(ctx, grad_output):  # pragma: no cover
        grad_input = None
        _, alpha_ = ctx.saved_tensors
        if ctx.needs_input_grad[0]:
            grad_input = -grad_output * alpha_
        return grad_input, None


revgrad = RevGrad.apply

github.com

tadeephuy/GradientReversal/blob/5d9857d63/gradient_reversal/functional.py

from torch.autograd import Function

class GradientReversal(Function):
    @staticmethod
    def forward(ctx, x, alpha):
        ctx.save_for_backward(x, alpha)
        return x
    
    @staticmethod
    def backward(ctx, grad_output):
        grad_input = None
        _, alpha = ctx.saved_tensors
        if ctx.needs_input_grad[0]:
            grad_input = - alpha*grad_output
        return grad_input, None
revgrad = GradientReversal.apply

Some questions on this code:

Fairseq just does ctx.scale = scale, while the other implementations use ctx.save_for_backward(input_, alpha_). What’s the difference? What is better?

Fairseq uses res = x.new(x) but the others do not. Why is this needed? What does it actually do? I did not found the documentation on Tensor.new.

The other implementations check for ctx.needs_input_grad[0] in the backward pass but Fairseq does not. Is this not needed?