Struggle comparing PyTorch and Python results for a simple architecture

I built the following simple architecture in Pytorch, to try to predict cat vs. non cat from Coursera.

class Net(nn.Module): 
    def __init__(self):
        super(Net,self).__init__()  
        self.input_layer = nn.Linear(dim,7)  
        self.hidden_layer = nn.Linear(7,1) 
        self.output_layer = nn.Linear(1,2) # 2 classes 
           
    def forward(self, x):
        x = x.view(-1,dim) 
        x = F.relu(self.input_layer(x)) 
        x = F.relu(self.hidden_layer(x))
        x = self.output_layer(x) 
        return F.log_softmax(x, dim=1) 
    
net= Net()
#Training 
optimizer = optim.SGD(net.parameters(), lr=0.0075) 

iterations=0
losses = np.array([])
for e in range(epochs):

    for data in (train_loader): 

        X,y = data
        net.zero_grad()
        output = net(X.view(-1, dim)) 
        loss = F.nll_loss(output, y)   
        loss.backward() #calc. gradients
        optimizer.step() #update weights
        iterations +=1
        losses = np.append(losses, loss.item())
        print(loss)

print ('final loss:', loss)   
print (losses)

I have the expected results from Coursera, so I used the same lr=0.0075 and same number of nodes per layer BUT my results are really different and not good:

    1. using the same learning rate but SGD instead of gradient descent, the loss is ridiculous noisy, no matter how much I increase the batch size to soften the curve. The situation only improves reducing the lr one order…
  • the probabilities predict only one of the classes and accuracy is low.

I explored the train and test datasets, and they happen to be unbalanced (train dataset has 65% of 0, 35% of 1 over 209 examples, while the test set consists 50 images: 34% of 0 and 64% of 1, thus, unbalanced in favor of the opposite class of the trainset)
This unbalance does not pose a problem for Coursera example, but seems that my architecture is not dealing in the same manner?
Is it possible that SGD is way more sensitive than GD for treating this unbalanced datasets? Or should I put my eye on trying to find another source of error?
Besides, they use random initialization for weights, multiplying by 0.01, and zero for the bias in each layer… Is it possible that using the default Kaiming initialization for nn.Linear changes the results significantly?

P.S.:

  • I am using dataloader with shuffle = True, to avoid uploading batches that accidentally consist of only label 0 or only label 1.
  • Since dataloader accepts the format [bs, C, H, W] and the original dataset is [n_examples, H, W, C] , I do a swap operation on the initial numpy 4D array as follow, before building the tensors.
Images_dataset = np.swapaxes(images_dataset,1,3)

The swapaxes operation will create an input of [N, C, W, H], while [N, C, H, W] is expected.
I would recommend to use x = x.permute(0, 3, 1, 2) on the tensor.
This shouldn’t make a huge difference if you are flattening the tensor anyway.

Yes, that could make a huge difference.
Could you apply the same initialization as is used in the Coursera code?

I also assume that the model architecture is equal to the one from the course?
The single output feature of hidden looks a bit “odd”, but might still work.

If the architecture has good prediction power, shouldn’t it also be good at predicting 90 degree rotated images? I mean, if I made the mistake of feeding WH instead of HW for the whole trainset and test set, shouldn’t it be able to extract the features anyway? Or is this “rotation” any different in 4D? I would love to conceptually understand this.
I will do then the operation in PyTorch using permute and change the default initialization to see if something improves.
Yes, they oddly use only one node at the end. (They have another deeper architecture with 4 layers but i face same issues with that one)
Last thing: Sigmoid activation on the last layer plus cross entropy loss is equivalent to NLL plus Softmax, I am right?
Thanks!!

I think it shouldn’t matter, if you are retraining the model and also flattening the tensor for this particular model.

No, nn.CrossEntropyLoss expects raw logits (no activation at the end), while nn.NLLLoss expects log probabilities (F.log_softmax at the end).
Since you are using the second approach, it should be fine.

1 Like