Restrict range of variable during gradient descent

Is it possible to restrict the range of possible values that a Variable can take? I have a variable that I want to restrict to the range [0, 1] but the optimizer will send it out of this range. I am using torch.clamp() to ultimately clamp the result to [0,1] but I want my optimizer to not update the value to be < 0 or > 1. Like if my variable currently sits at a value of 0.1, and the gradients come in and my optimizer wants to update it by 0.5, which would make the new value -0.4, I want the optimizer to clamp it’s update to 0.1, so it will only get updated up to my bounds.

I know I can register a hook for the variable, which I tried, but that way I can only control the size of the gradient, not the actual update size. I’m sure if I just wrote a custom optimizer I could make it work but there’s no way I can beat the Adam optimizer.

3 Likes

I would copy the code for the Adam optimizer and modify it to do what you want.

1 Like

For your example (constraining variables to be between 0 and 1), there’s no difference between what you’re suggesting – clipping the gradient update – versus letting that gradient update take place in full and then clipping the weights afterwards. Clipping the weights, however, is much easier than modifying the optimizer.

Here’s a simple example of a UnitNorm clipper:

class UnitNormClipper(object):

    def __init__(self, frequency=5):
        self.frequency = frequency

    def __call__(self, module):
        # filter the variables to get the ones you want
        if hasattr(module, 'weight'):
            w = module.weight.data
            w.div_(torch.norm(w, 2, 1).expand_as(w))

Instantiating this with clipper = UnitNormClipper(), then, after the optimizer.step() call, do the following:

model.apply(clipper)

Full training loop example:

        for epoch in range(nb_epoch):
            for batch_idx in range(nb_batches):
                xbatch = x[batch_idx*batch_size:(batch_idx+1)*batch_size]
                ybatch = y[batch_idx*batch_size:(batch_idx+1)*batch_size]

                optimizer.zero_grad()
                xp, yp = model(xbatch, ybatch)
                loss = model.loss(xp, yp)
                loss.backward()
                optimizer.step()

            if epoch % clipper.frequency == 0:
                model.apply(clipper)

A 0-1 clipper might look like this (not tested):

class ZeroOneClipper(object):

    def __init__(self, frequency=5):
        self.frequency = frequency

    def __call__(self, module):
        # filter the variables to get the ones you want
        if hasattr(module, 'weight'):
            w = module.weight.data
            w.sub_(torch.min(w)).div_(torch.max(w) - torch.min(w))
8 Likes

Thanks for your reply. If I try to clip the variable after each optimizer step I get the following error:

RuntimeError: Trying to backward through the graph second time, but the buffers have already been freed. Please specify retain_variables=True when calling backward for the first time.

It seems like if you manually mess with the variable state then the variable gets marked dirty or something.

EDIT: Oh, I guess if you only manipulate the .data attribute you don’t get that error. It’s working now, thanks!!

1 Like

I am trying to optimize a function with gradient descent and I have a constraint that the values should be between zero and one. Is clamping the updated values the only way to deal with this kind of problems? is it a common approach for dealing with this constraint in machine learning community? another approach that I tried was to use logarithm of the weights as the variables for optimization, but it just solves the problem of positivity.
It might not be the best place to ask this question but I found this post very related.

Yeah I mean you can either clip the weights after some number of gradient updates or you can add a deviation of the weights from your desired value as an extra term added to the model loss function - sort of a lagrangian approach with a penalty on that deviation resulting in a more loose or more strict implicit constraint. It depends on the problem but the lagrangian approach is probably better in most case (it’s basically what you do with regularization/sparsity instead of directly imposing sparsity on weights).

1 Like