Inconsistent decreases in training loss across training runs

I have a computer vision problem, but it’s probably relevant to other deep learning models.
Sometimes my model will immediately get stuck at a high training loss and won’t be able to improve from there. If I re-train the model several times, eventually one of the training runs results in learning. I’m assuming this has to do with bad initialization of weights, but I was wondering what all I should try to address this.

I have a small dataset (only ~100 positive examples). The problem is binary classification.

For reference here is the model:

class CNN(nn.Module):
    def __init__(self, input_dim):
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=32, kernel_size=3, padding=1)
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)

        self.conv3 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1)
        self.conv4 = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, padding=1)
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)

        # Input to classifier is input_dim / 4 because 2 max pool layers with kernel_size of 2
        final_dim = int(input_dim / 4)
        self.classifier = nn.Sequential(
            nn.Linear(in_features=64 * final_dim * final_dim, out_features=128),
            nn.Linear(in_features=128, out_features=2)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = self.pool1(x)
        x = F.dropout(x, p=.25)

        x = self.conv3(x)
        x = F.relu(x)
        x = self.conv4(x)
        x = F.relu(x)
        x = self.pool2(x)
        x = F.dropout(x, p=.25)

        x = x.reshape(x.shape[0], -1)
        x = self.classifier(x)
        x = F.softmax(x, dim=-1)
        return x


  • NLLLoss
    -Adam optimizer
    -lr 1e-3

Data is normalized to .5 mean and .5 std and had range [0, 1]


nn.NLLLoss expects log probabilities so you should use F.log_softmax as the last activation instead of F.softmax. Could you change it and check, if your training performs better?