Gradient Ascent and Gradient Modification/Modifying Optimizer instead of Grad_weight

Hi All,
I have a few questions related to the topic of modifying gradients and the optimizer. I’m wondering if there is an easy way to perform gradient ascent instead of gradient descent. For example, this would correspond to replacing grad_weight by -grad_weight in linear layer definition as seen in class LinearFunction(Function): from the Extending PyTorch page. My concern here is that this will mess up a downstream function that requires grad_weight instead of -grad_weight, or is this not a concern at all? A suggestion made to me was to try to modify the optimizer. Is there a simple way to go about doing W + dW instead of W - dW in the optimizer? I can’t really tell from the source code for SGD or ADAM.

Thanks for reading!


The simplest way to do gradient ascent on a loss L is to do gradient descent on -L . :smiley:


That is an interesting solution. I think I need to further clarify my original question. I would like to include a negative sign on the updates to the weights, and this corresponds to changing grad_weight to -grad_weight, while grad_input and grad_bias are left untouched. However, I am wary of unintended consequences of doing something like this to the gradients, and was wondering if there was an easy way to change the optimizer such that it performed gradient ascent(W + dW) for the non last layer weights specifically, but left the other parameters alone?

In that case I guess you will have to create your custom optimizer to handle that. With one group for the descent part and one group for the ascent part for example.

1 Like

Continuing the discussion from Gradient Ascent and Gradient Modification/Modifying Optimizer instead of Grad_weight:

I’m working on a similar problem where I need to optimize the following loss function:

Here w (omega) is model parameter and Lamdas are Lagrange Multipliers. I need to perform gradient descent wrt. omega and simultaneously gradient ascent wrt. lambda. lambda is not a model parameter and only included in the loss term.
Will your solution of updating lambda using gradient descent on -L work in this case? If it does then taking negative learning rate for lambdas in gradient descent should also be equivalent. And if it doesn’t then what should be the pytorch solution for this(without changing the optimizer source code)? Or should I need to creat a custom optimizer?

1 Like

I think that this is a bit too late, but the solution I came up with is to use a custom autograd function, which reverses gradient direction. As like as @Tamal_Chowdhury , I have a lagrangian optimization problem, for which this function works perfectly. A small working example would be:

import torch

class AscentFunction(torch.autograd.Function):
    def forward(ctx, input):
        return input

    def backward(ctx, grad_input):
        return -grad_input

def make_ascent(loss):
    return AscentFunction.apply(loss)

x = torch.normal(10, 3, size=(10,))
w = torch.ones_like(x, requires_grad=True)

loss = (x * w).sum()
print(f'descent loss: {loss.item():.2f}')


w.grad = None

loss = (x * w).sum()
m_loss = make_ascent(loss)
print(f'ascent loss: {m_loss.item():.2f}')


It’s output:

descent loss: 96.13
tensor([12.7093, 11.2243,  6.4265,  7.6572, 14.2737, 15.1144,  8.0099,  6.2517,
         7.6352,  6.8274])
ascent loss: 96.13
tensor([-12.7093, -11.2243,  -6.4265,  -7.6572, -14.2737, -15.1144,  -8.0099,
         -6.2517,  -7.6352,  -6.8274])

I got the same problem.