PyTorch ConvNet randomly learns a single label and gets stuck

I am trying to classify two types of images (i.e. 2 labels). I have over 30k images for my training dataset, and have built the CNN that’s described below.

Here’s the weird part - I’ve trained this network over 20 times, and I get 2 completely different behaviours -

  1. The network will in fact “learn”, loss decreases and after about 10 epochs I’ll correctly classify about 80% of my test dataset.

  2. After seeing ~1k pictures (not even a single epoch), the network will “decide” to always classify the images as only one of the labels! Loss get’s stuck and nothing happens

The strange part - it's the exact same piece of code! nothing changes. I tried debugging for so many hours and got to a point where I don't have a clue what's going on.

The only thing that I noticed is when the network fails to learn, then the initial output of the network (w/o performing even one backprop iteration) is something like -
model = MyNetwork()
images, labels  = iter(train_dataset_loader).next()
fw_pass_output = model(images)
tensor([[9.4696e-09, 1.0000e+00],
        [2.8105e-08, 1.0000e+00],
        [7.4285e-09, 1.0000e+00],
        [4.3701e-09, 1.0000e+00],
        [4.4942e-08, 1.0000e+00]], grad_fn=<SliceBackward>)

And on the other hand when the network does succeed in learning, it’ll look like this -

tensor([[0.4982, 0.5018],
        [0.4353, 0.5647],
        [0.3051, 0.6949],
        [0.4823, 0.5177],
        [0.4342, 0.5658]], grad_fn=<SliceBackward>)

So as you can see - when the network manages to learn, seems like the initial weight initialisation is allowing more balanced results, while on another occasion it’ll just arbitrarily assign everything to one of the classes, and from there it never manages to learn anything.

I tried swapping between Adam/SGD, lot’s of different learning rates, different weight decay values, nothing helped…

What am I missing? What’s causing this behaviour? Is it a bug in my code, or a concept that I’m missing?

class MyNetwork(nn.Module):
    def __init__(self):
        super(MyNetwork, self).__init__()

        self.conv1_layer = nn.Conv2d(3, 32, 5) 
        self.conv2_layer = nn.Conv2d(32, 16, 3) 
        self.conv3_layer = nn.Conv2d(16, 8, 2) 

        self.layer_size_after_convs = 8 * 5 * 5 
        self.fc1 = nn.Linear(self.layer_size_after_convs, TOTAL_NUM_OF_CLASSES)
    def forward(self, x):
        Perform a forward pass on the network
        x = F.relu(self.conv1_layer(x))
        x = F.max_pool2d(x, (3, 3))

        x = F.relu(self.conv2_layer(x))
        x = F.max_pool2d(x, (2, 2))

        x = F.relu(self.conv3_layer(x))
        x = F.max_pool2d(x, (2, 2))
            x = x.view(-1, self.layer_size_after_convs)
        x = self.fc1(x)
        x = F.softmax(x, dim=1)
        return x

model = MyNetwork()
loss_func = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, weight_decay=0.1)

total_steps = len(train_dataset_loader)

epochs = 100
for epoch_num in range(epochs):
    for i, (img_batch, labels) in enumerate(train_dataset_loader):
        fw_pass_output = model(img_batch)
        loss_values = loss_func(fw_pass_output, labels)        

        print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch_num+1, epochs, i+1, total_steps, loss_values.item()))


can you add an additional layer to your fully connected layer? like going from 200 -> 2 seems like a big drop. May go from 200 -> 25 (1/8) -> 2 ( ~ 1/12). See if that works for you.

Edit 1: did you use seed in order to replicate your result?
Edit 2: I suggest also using a dropout layer somewhere between your convolution layers in order to regularize