SGD - good results. ADAM - really bad - why?

this is really wired.
i am using a solution code for CIFAR10 using cnn (from Udacity deep learning , public repository).

when i train the network using SGD optimizer - i get accuracy of 72% (lets assume its good)
but when i change to ADAM - i get 10% accuracy, and doing the train the loss dont change.

i have check multiple time that its the correct code, and it is!

do you know the reason? , as known Adam should be better then SGD .

this is the loss for Adam:
Epoch: 1 Training Loss: 2.305185 Validation Loss: 2.303153
Epoch: 2 Training Loss: 2.304046 Validation Loss: 2.303932
Epoch: 3 Training Loss: 2.304276 Validation Loss: 2.304587
Epoch: 4 Training Loss: 2.304290 Validation Loss: 2.304771
Epoch: 5 Training Loss: 2.304313 Validation Loss: 2.304767
Epoch: 6 Training Loss: 2.304428 Validation Loss: 2.305437
Epoch: 7 Training Loss: 2.304531 Validation Loss: 2.303925
Epoch: 8 Training Loss: 2.304736 Validation Loss: 2.304267

here is loss for Adam
Epoch: 1 Training Loss: 1.484200 Validation Loss: 0.292755
Epoch: 2 Training Loss: 1.128465 Validation Loss: 0.250197
Epoch: 3 Training Loss: 0.975305 Validation Loss: 0.219079
Epoch: 4 Training Loss: 0.866838 Validation Loss: 0.219783
Epoch: 5 Training Loss: 0.779665 Validation Loss: 0.180415
Epoch: 6 Training Loss: 0.714491 Validation Loss: 0.168658
Epoch: 7 Training Loss: 0.659699 Validation Loss: 0.160974
Epoch: 8 Training Loss: 0.613168 Validation Loss: 0.158714

here is a link to colab if you want: (i had to pot spaces , as new used i am not allowed to add link)

h t t p s : / / c o l a b . r e s e a r c h . g o o g l e . c o m / d r i v e / 1 g w Z C h d 4 C 4 b 7 I X T r 0 H d S s c m g l q c M l o s T W ? u s p = s h a r i n g

and here is the entire code:

import torch
import numpy as np

check if CUDA is available

train_on_gpu = torch.cuda.is_available()

if not train_on_gpu:
print(‘CUDA is not available. Training on CPU …’)
print(‘CUDA is available! Training on GPU …’)

from torchvision import datasets
import torchvision.transforms as transforms
from import SubsetRandomSampler

number of subprocesses to use for data loading

num_workers = 0

how many samples per batch to load

batch_size = 20

percentage of training set to use as validation

valid_size = 0.2

convert data to a normalized torch.FloatTensor

transform = transforms.Compose([
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))

choose the training and test datasets

train_data = datasets.CIFAR10(‘data’, train=True,
download=True, transform=transform)
test_data = datasets.CIFAR10(‘data’, train=False,
download=True, transform=transform)

obtain training indices that will be used for validation

num_train = len(train_data)
indices = list(range(num_train))
split = int(np.floor(valid_size * num_train))
train_idx, valid_idx = indices[split:], indices[:split]

define samplers for obtaining training and validation batches

train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)

prepare data loaders (combine dataset and sampler)

train_loader =, batch_size=batch_size,
sampler=train_sampler, num_workers=num_workers)
valid_loader =, batch_size=batch_size,
sampler=valid_sampler, num_workers=num_workers)
test_loader =, batch_size=batch_size,

specify the image classes

classes = [‘airplane’, ‘automobile’, ‘bird’, ‘cat’, ‘deer’,
‘dog’, ‘frog’, ‘horse’, ‘ship’, ‘truck’]

“”"### Visualize a Batch of Training Data"""

Commented out IPython magic to ensure Python compatibility.

import matplotlib.pyplot as plt

%matplotlib inline

helper function to un-normalize and display an image

def imshow(img):
img = img / 2 + 0.5 # unnormalize
plt.imshow(np.transpose(img, (1, 2, 0))) # convert from Tensor image

obtain one batch of training images

dataiter = iter(train_loader)
images, labels =
images = images.numpy() # convert images to numpy for display

plot the images in the batch, along with the corresponding labels

fig = plt.figure(figsize=(25, 4))

display 20 images

for idx in np.arange(20):
ax = fig.add_subplot(2, 20/2, idx+1, xticks=[], yticks=[])

“”"### View an Image in More Detail

Here, we look at the normalized red, green, and blue (RGB) color channels as three separate, grayscale intensity images.

rgb_img = np.squeeze(images[3])
channels = [‘red channel’, ‘green channel’, ‘blue channel’]

fig = plt.figure(figsize = (36, 36))
for idx in np.arange(rgb_img.shape[0]):
ax = fig.add_subplot(1, 3, idx + 1)
img = rgb_img[idx]
ax.imshow(img, cmap=‘gray’)
width, height = img.shape
thresh = img.max()/2.5
for x in range(width):
for y in range(height):
val = round(img[x][y],2) if img[x][y] !=0 else 0
ax.annotate(str(val), xy=(y,x),
verticalalignment=‘center’, size=8,
color=‘white’ if img[x][y]<thresh else ‘black’)

import torch.nn as nn
import torch.nn.functional as F

define the CNN architecture

class Net(nn.Module):
def init(self):
super(Net, self).init()
# convolutional layer (sees 32x32x3 image tensor)
self.conv1 = nn.Conv2d(3, 16, 3, padding=1)
# convolutional layer (sees 16x16x16 tensor)
self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
# convolutional layer (sees 8x8x32 tensor)
self.conv3 = nn.Conv2d(32, 64, 3, padding=1)
# max pooling layer
self.pool = nn.MaxPool2d(2, 2)
# linear layer (64 * 4 * 4 → 500)
self.fc1 = nn.Linear(64 * 4 * 4, 500)
# linear layer (500 → 10)
self.fc2 = nn.Linear(500, 10)
# dropout layer (p=0.25)
self.dropout = nn.Dropout(0.25)

def forward(self, x):
    # add sequence of convolutional and max pooling layers
    x = self.pool(F.relu(self.conv1(x)))
    x = self.pool(F.relu(self.conv2(x)))
    x = self.pool(F.relu(self.conv3(x)))
    # flatten image input
    x = x.view(-1, 64 * 4 * 4)
    # add dropout layer
    x = self.dropout(x)
    # add 1st hidden layer, with relu activation function
    x = F.relu(self.fc1(x))
    # add dropout layer
    x = self.dropout(x)
    # add 2nd hidden layer, with relu activation function
    x = self.fc2(x)
    return x

create a complete CNN

model = Net()

move tensors to GPU if CUDA is available

if train_on_gpu:

import torch.optim as optim

specify loss function (categorical cross-entropy)

criterion = nn.CrossEntropyLoss()

specify optimizer

optimizer = optim.Adam(model.parameters(), lr=0.01)

number of epochs to train the model

n_epochs = 30

valid_loss_min = np.Inf # track change in validation loss

for epoch in range(1, n_epochs+1):

# keep track of training and validation loss
train_loss = 0.0
valid_loss = 0.0

# train the model #
for data, target in train_loader:
    # move tensors to GPU if CUDA is available
    if train_on_gpu:
        data, target = data.cuda(), target.cuda()
    # clear the gradients of all optimized variables
    # forward pass: compute predicted outputs by passing inputs to the model
    output = model(data)
    # calculate the batch loss
    loss = criterion(output, target)
    # backward pass: compute gradient of the loss with respect to model parameters
    # perform a single optimization step (parameter update)
    # update training loss
    train_loss += loss.item()*data.size(0)
# validate the model #
for data, target in valid_loader:
    # move tensors to GPU if CUDA is available
    if train_on_gpu:
        data, target = data.cuda(), target.cuda()
    # forward pass: compute predicted outputs by passing inputs to the model
    output = model(data)
    # calculate the batch loss
    loss = criterion(output, target)
    # update average validation loss 
    valid_loss += loss.item()*data.size(0)

# calculate average losses
train_loss = train_loss/len(train_loader.sampler)
valid_loss = valid_loss/len(valid_loader.sampler)
# print training/validation statistics 
print('Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f}'.format(
    epoch, train_loss, valid_loss))

# save model if validation loss has decreased
if valid_loss <= valid_loss_min:
    print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(
    valid_loss)), '')
    valid_loss_min = valid_loss

“”"### Load the Model with the Lowest Validation Loss"""



Test the Trained Network

Test your trained model on previously unseen data! A “good” result will be a CNN that gets around 70% (or more, try your best!) accuracy on these test images.

track test loss

test_loss = 0.0
class_correct = list(0. for i in range(10))
class_total = list(0. for i in range(10))


iterate over test data

for data, target in test_loader:
# move tensors to GPU if CUDA is available
if train_on_gpu:
data, target = data.cuda(), target.cuda()
# forward pass: compute predicted outputs by passing inputs to the model
output = model(data)
# calculate the batch loss
loss = criterion(output, target)
# update test loss
test_loss += loss.item()*data.size(0)
# convert output probabilities to predicted class
_, pred = torch.max(output, 1)
# compare predictions to true label
correct_tensor = pred.eq(
correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
# calculate test accuracy for each object class
for i in range(batch_size):
label =[i]
class_correct[label] += correct[i].item()
class_total[label] += 1

average test loss

test_loss = test_loss/len(test_loader.dataset)
print(‘Test Loss: {:.6f}\n’.format(test_loss))

for i in range(10):
if class_total[i] > 0:
print(‘Test Accuracy of %5s: %2d%% (%2d/%2d)’ % (
classes[i], 100 * class_correct[i] / class_total[i],
np.sum(class_correct[i]), np.sum(class_total[i])))
print(‘Test Accuracy of %5s: N/A (no training examples)’ % (classes[i]))

print(’\nTest Accuracy (Overall): %2d%% (%2d/%2d)’ % (
100. * np.sum(class_correct) / np.sum(class_total),
np.sum(class_correct), np.sum(class_total)))


about the question .
the SECOND loss is SGD (I accidentally wrote Adam)

you should use way lower lr with adam to work.
try lr = 1e-4 or lower

this is not always true.

Hi @mMagmer ,
i change the lr as you suggested and i see massive improvement. - thank you.
this raise numbers of question:

  1. dose Adam optimizer usually need lr in this order of magnitude?
  2. is it a thumb rule what lr i should use in each optimizer?
  3. next time when i build a network and i fail in training it - how can i know if its because bad architecture or because (like in this case) i didn’t choose the right lr ? (there are so many parameters - how could i know were i was wrong?)

thank you very much for your king help,

If you want, I answer your questions, but I might be wrong.

  1. yes
  2. I think you can play around with default lr setup in optim method. i.e. for SGD it’s .1 for Adam it’s 1e-3.
  3. if your model architecture does not require specific optim method, you can use SGD with momentum. It’s very well-behaved.

It’s also depended on batch size and how you are averaging the loss.

@mMagmer thank for your answers

can you please explain more about it?
what should i assume when i use small/big batch size and how the loss avg affect the learning rate

By default, nn.crossentropy and other loss are using reduction='mean', if set reduction='sum', you are effectively multiplying LR by batch size.
Even if you’re using ‘mean’, Some of optim methods are requited to update LR after changing batch size.

@mMagmer thank you for your answer.
i will read about it more