Random training runs don't converge

I’m training a basic VGG network on a vision dataset of 48x48 images. I run the same script multiple times to report the average but some runs randomly don’t converge. Meaning sometimes classification accuracy doesn’t increase and loss doesn’t decrease. This happens randomly (e.g. 1 run out of 5 runs). Where should I start to debug?! what are the possible culprits?

This is pretty common when training deep networks. A lot of the underlying processes (eg. initialization of network weights, convolutional backend, mini-batch sequencing, hyperparameters etc.) are randomized. Some configurations perform better/worse than others.

If you want your experiments to be perfectly repeatable, consider adding something like this to your main script.

randomseed = 12345 # Or your favourite integer
np.random.seed(randomseed) # If you are using numpy functions anywhere
torch.manual_seed(randomseed)
torch.cuda.manual_seed(randomseed) # If you are using pytorch on a GPU
# One other source of randomness, that most people ignore
torch.backends.cudnn.deterministic = True
# If you are using python's random module, also include this one
random.seed(randomseed)

I agree. The randomness is intrinsic and one might prefer deterministic when the target is production. But when doing research most people average their results between runs and check against any modification that they might have done. How about that situation. Can it be for example due to the learning rate being on the borderline? and by this I mean learning rate sometimes causes the model to converge and sometimes it doesn’t.

This might be a reason, but I would rather suspect the parameter initializations to create these failing runs sometimes.
Do you use any custom init functions or the default ones?

Yes I am using the default VGG initializers:

def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)

I am using a custom cross-entropy which is basically the same as the built-in function. I don’t think it is the reason though!

class Loss(torch.nn.Module):
    def __init__(self):
        super(Loss, self).__init__()
        self.epsilon = 1e-7
        self.log_softmax = nn.LogSoftmax(dim=1)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, prediction, target):
        clamped = torch.clamp(-self.log_softmax(prediction), self.epsilon, 1000)
        loss = torch.mean((torch.sum(target.float() * clamped, dim=1)))
        return loss