I built the following simple architecture in Pytorch, to try to predict cat vs. non cat from Coursera.
class Net(nn.Module):
def __init__(self):
super(Net,self).__init__()
self.input_layer = nn.Linear(dim,7)
self.hidden_layer = nn.Linear(7,1)
self.output_layer = nn.Linear(1,2) # 2 classes
def forward(self, x):
x = x.view(1,dim)
x = F.relu(self.input_layer(x))
x = F.relu(self.hidden_layer(x))
x = self.output_layer(x)
return F.log_softmax(x, dim=1)
net= Net()
#Training
optimizer = optim.SGD(net.parameters(), lr=0.0075)
iterations=0
losses = np.array([])
for e in range(epochs):
for data in (train_loader):
X,y = data
net.zero_grad()
output = net(X.view(1, dim))
loss = F.nll_loss(output, y)
loss.backward() #calc. gradients
optimizer.step() #update weights
iterations +=1
losses = np.append(losses, loss.item())
print(loss)
print ('final loss:', loss)
print (losses)
I have the expected results from Coursera, so I used the same lr=0.0075 and same number of nodes per layer BUT my results are really different and not good:

 using the same learning rate but SGD instead of gradient descent, the loss is ridiculous noisy, no matter how much I increase the batch size to soften the curve. The situation only improves reducing the lr one orderâ€¦
 the probabilities predict only one of the classes and accuracy is low.
I explored the train and test datasets, and they happen to be unbalanced (train dataset has 65% of 0, 35% of 1 over 209 examples, while the test set consists 50 images: 34% of 0 and 64% of 1, thus, unbalanced in favor of the opposite class of the trainset)
This unbalance does not pose a problem for Coursera example, but seems that my architecture is not dealing in the same manner?
Is it possible that SGD is way more sensitive than GD for treating this unbalanced datasets? Or should I put my eye on trying to find another source of error?
Besides, they use random initialization for weights, multiplying by 0.01, and zero for the bias in each layerâ€¦ Is it possible that using the default Kaiming initialization for nn.Linear changes the results significantly?
P.S.:
 I am using dataloader with shuffle = True, to avoid uploading batches that accidentally consist of only label 0 or only label 1.
 Since dataloader accepts the format [bs, C, H, W] and the original dataset is [n_examples, H, W, C] , I do a swap operation on the initial numpy 4D array as follow, before building the tensors.
Images_dataset = np.swapaxes(images_dataset,1,3)