Very small learning rate needed for convergence

I am training a custom patch-based network (4 layers) and I realized that, unless I set the lr to 0.0000001 to start converging somehow.
I feel like something is wrong with such a tuning.
However, if I run the training with lr = 0.01, loss will get huge, like several billions and eventually NaN.

Can this be related to initialisation ?

It could be related to the weight init or other hyper-parameters.
How do you initialize your model?
Which optimizer are you using? Did you change the momentum (if available)?

Weight initialisation is done through Xavier’s approach :
m.weight.data.normal_(0, math.sqrt(2. / n)), for each conv module m in the network
m.weight.data.normal_(0, 0.01), for the fc layer on top of the net.

As for momentum, I used the commonly 0.9 value although I do not know it that suits the learning rate I have since this one is really low.
I have to admit I am not really familiar with the relation between learning rate and momentum since both of them seem to impact the weight update on different aspects.

I was looking for more documentation about these two hyper-parameters when I came across weight decay. I understood that it aims at reducing the magnitude increase of the weights. Is that correct ?

Not sure how to tune momentum (and weight decay ?) so as to get a decent learning rate.

Ok, thanks for the update!
Could you post the code calculating the loss, the optimizer and if possible the model?
Maybe your gradients are really high somehow.

Usually a momentum of 0.9 should work fine.

Yes, weight decay penalizes the weight magnitude, forcing it to get lower values.

Thank you for such quick and helpful answers, really nice community !

Model (just the initilisation) --> comes from torchvision:

    def _initialize_weights(self):
    for m in self.modules():
        if isinstance(m, nn.Conv2d):
            n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
            m.weight.data.normal_(0, math.sqrt(2. / n))
            if m.bias is not None:
                m.bias.data.zero_()
        elif isinstance(m, nn.BatchNorm2d):
            m.weight.data.fill_(1)
            m.bias.data.zero_()
        elif isinstance(m, nn.Linear):
            m.weight.data.normal_(0, 0.01)
            m.bias.data.zero_()

Optimizer:

optimizer = optim.SGD(net.parameters(),
                  lr=0.00000001,
                  momentum=0.9,
                  weight_decay=0.0005)  

Loss computation:

def runEpoch(self):
    loss_list = np.zeros(100000)
    for it, batch in enumerate(tqdm(self.data_loader)):
        data = Variable(batch['image'])
        target = Variable(batch['class_code'])
        # forward
        if self.mode == 'cuda':
            data = data.cuda()
            target = target.long().cuda()
        output = self.model.forward(data)
        loss = self.criterion(output.float(), target)
        loss.backward()
        self.optimizer.step()
        loss_list[it] = loss.item()
    self.avg_loss = np.mean(loss_list[np.nonzero(loss_list)])

Also, note that I am using 4-band 8-bit encoded images as input.

From skimming your code, it looks like you are not zeroing out the gradients after the weight update.
In this case the gradients get accumulated and the weight updates will be quite useless.

Add this line into your for loop and run it again:

self.optimizer.zero_grad()

It is also recommended to call the model directly to compute the forward pass instead of model.forward().
If you use model.forward() the hooks won’t be called, which might be unimportant in your current code, but might lead to errors in the future. :wink:

Also, you could update to the latest stable release (0.4.0). You can find the install instructions on the website.

Thank you very much, just tried it and it improves the performance by almost 8%!
I am also able to use a much more reasonable learning rate (0.001, i will try with learning rate decay as well) !
Why is it so important to zero gradients ? Is that because it would otherwise compute gradients and add them addition to those from the previous iteration (sorry… not sure of the meaning of accumulating here) ?

Can you imagine a situation were you would not do such a thing ? From what I have seen, it could be implicit when using optimizer.step()

I am already using 0.4.0 release :smiley:
Thank you again !

Yes, your explanation is right. The gradients from each backward pass would be summed to the previous gradients.

You could use it to artificially increase your batch size.
If you don’t have enough GPU memory, but need a larger batch size, you could sum the gradients for several smaller batches, average them, and finally perform the optimization step.

Ah ok, then you don’t need the Variable wrapper anymore, since they were merged with tensors. :wink:

I am using Adamax optimizer with a learning rate = 0.001 and momentum = 0.9, It is working fine for less number of classes (1-50 classes) but for a large number of classes (200-300) I am getting low accuracy and high loss. what should I do in this case?
Thanks in advance