Why is PyTorch Maximizing the Loss?

Tom_Ginsberg · November 5, 2019, 11:31am

I have a simple training loop that looks like this.

optim = torch.optim.Adam(verify_net.parameters())
# The training is designed to continue untill timeout or goal
while True:
        optim.zero_grad()
        # Forward pass
        outputs = verify_net(inputs)

        # Compute how much bigger every label is be 
        # from the true label in the worst case
        losses = torch.stack(
            [upper_bound(outputs[i] - outputs[true_label]) \
                for i in range(len(outputs)) if i != true_label])

        if (losses < 0).all():
            return True

        loss = torch.sum(losses)
        print(loss)
        loss.backward()
        optim.step()

The optimization goal is to make outputs[true_label] greater then all other outputs in a worst case that is computed by the upper_bound function

upper_bound = lambda x: x[0] + torch.sum(torch.abs(x[1:]))

If after any step all upper bounds losses[i] < 0 then the loop may break and we are happy. But till then the optimizer is to do its task. The output I get by printing the loss after each step is this

tensor(632.8606, grad_fn=<SumBackward0>)
tensor(668.1876, grad_fn=<SumBackward0>)
tensor(698.2267, grad_fn=<SumBackward0>)
tensor(733.7394, grad_fn=<SumBackward0>)
tensor(764.9390, grad_fn=<SumBackward0>)
tensor(799.8583, grad_fn=<SumBackward0>)
tensor(834.7762, grad_fn=<SumBackward0>)
tensor(861.3510, grad_fn=<SumBackward0>)
tensor(895.9908, grad_fn=<SumBackward0>)
tensor(930.6293, grad_fn=<SumBackward0>)
tensor(965.2657, grad_fn=<SumBackward0>)
tensor(1000.8009, grad_fn=<SumBackward0>)

It seems like Adam is doing a fantastic job at maximizing the loss however this really isn’t the behavior I’d expect or want. My first attempt was to change loss into -loss but I had no luck because Adam starting to minimize the now negated loss

Any suggestions would be much appreciated. Thanks

albanD · November 5, 2019, 3:17pm

Hi

Have you tried reducing the learning rate? Or switch to a simple SGD?
High learning rates can cause the network to diverge sometimes.

Tom_Ginsberg · November 5, 2019, 3:37pm

Even with lr=.000001 I get the same behaviour for both Adam and SGD

albanD · November 5, 2019, 3:48pm

Could you try and simplify the problem to see if you still see the same behavior?

Change the model to a very basic thing (like one single Linear layer)
Change the loss to something simpler? (remove the absolute values)