Image recognition (Alexnet) training loss is not decreasing

Hello everyone

I am new to the website and this field as well

i was tying to train alexnet using caltech dataset, i wasnt getting any error but the training loss is not decreasing during the training phase

i wonder if anyone knows the problem

following is the code


if not os.path.isdir('./Homework2-Caltech101'):
  !git clone

DATA_DIR = 'Homework2-Caltech101/101_ObjectCategories'
SPLIT_TRAIN = 'Homework2-Caltech101/train.txt'
SPLIT_TEST = 'Homework2-Caltech101/test.txt'

train_dataset = Caltech(DATA_DIR, split = SPLIT_TRAIN, transform=train_transform) 
net = alexnet() # Loading AlexNet model
net.classifier[6] = nn.Linear(4096, NUM_CLASSES)

criterion = nn.CrossEntropyLoss() # for classification, we use Cross Entropy

total_loss = 0
current_step = 0

num_train = len(train_dataset)
indices = list(range(num_train))
split = int(np.floor(valid_size * num_train)) #number of validation samples


train_idx, valid_idx = indices[split:], indices[:split]
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)

net = 
batch_size_list = [10, 100, 1000]
LR_list = [ .01, .001, .0001]
for BATCH_SIZE in batch_size_list:
    for LR in LR_list:
        train_loader =, batch_size=BATCH_SIZE, sampler=train_sampler,num_workers=4,  drop_last=True)
        optimizer = optim.SGD(net.parameters(), lr=LR, momentum=MOMENTUM, weight_decay=WEIGHT_DECAY)
        scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=STEP_SIZE, gamma=GAMMA)
        for epoch in range(2):
            total_loss = 0
            print('\n Starting epoch {}/{}, LR = {}, batch size= {}'.format(epoch+1, NUM_EPOCHS, scheduler.get_lr(), BATCH_SIZE))
            for images, labels in train_loader:
                images =
                labels =
                outputs = net(images)
                loss = criterion(outputs, labels)
                if current_step % LOG_FREQUENCY == 0:
                    print('Step {}, Loss {}'.format(current_step, loss.item()))
                loss.backward()  # backward pass: computes gradients
                optimizer.step() # update weights based on accumulated gradients
                current_step += 1
                total_loss += loss.item() * BATCH_SIZE
                print('Loss', total_loss)

Have you tried to play around with the hyper parameters, e.g. lowering down the learning rate etc.?
Is the loss constant, i.e. a single value or is it noisy around the initial value?

actually i am trying to tune the hyperparameters by looping the code by changing learning rate, batch size and number of epochs

the loss appears to be oscilating for all above values around the initial value

also i tried to change weight decay and momentum

Thanks for the update!
In that case I would recommend to try to overfit a small data sample as a simple test.
If your model is not able to overfit this small data sample (e.g. 10 samples), you might have a bug in the code somewhere, which I might have missed.

Once you can overfit, I would try to scale up the experiment slowly and make sure your model is still learning.

1 Like

Hi ptrblck,

I am facing the same issue with my code. I am trying to train ‘Alexnet’ model provided by torch library.
But my loss is not getting decreased. It is fluctuating between 2.311 and 2.312.
I tried changing the learning rate from 0.1, 0.05, 0.01 etc, batch size, and also epochs. But nothing is working.

Can you suggest some option? Can I send you my code? @ptrblck

Did you try to overfit a small data sample as suggested in the last post?
If not, I would highly recommend it, as it’s an easy and fast way to make sure the general training code and model could work.

1 Like

Hi @ptrblck, I’m having an issue along the same lines.

I’m using an Alexnet model from scratch, having more or less followed this breakdown Finetuning Torchvision Models — PyTorch Tutorials 1.2.0 documentation. My code can be seen here AlexNet-Filter-Fun/ at 3cd29eb55b3b11e2668211a5e4d129dc0dfbe359 · NeilFranks/AlexNet-Filter-Fun · GitHub.

As you recommended, I’m trying on a small data sample (two classes), but I’m getting absolutely no movement (in validation, it guesses the same class every time, resulting in an accuracy of 0.5).

Even in the tutorial I linked, at the bottom you can see they examined training models from scratch, and even THEIR accuracy stayed completely stagnant from the beginning.

What am I missing? I suspect it’s something to do with my optimizer?

Edit: Now I’m sure it’s to do with the optimizer; upon finding someone else trying to implement it the same way as the original Alexnet paper, they too say it “doesn’t train”, and when I use the optimizer they used instead (Adam), it trains okay!

Could you help a beginner like me understand why the Alexnet paper’s optimizer just doesn’t train?

It’s hard to tell what might be causing the failure in training you are seeing, but I would think it depends on the hyperparameters you are using as well as the overall training routine.

I would try to check the reference implementation (if there was one), as I’m sure there will be codes which reproduce the original claims of the paper (or come close to them). Alexnet is probably one of the more important models, so it would be surprising if the paper isn’t reproducible (unless the authors have a corrected version and the original paper doesn’t mention it).

It’s hard to tell what might be causing the failure in training you are seeing, but I would think it depends on the hyperparameters you are using as well as the overall training routine.

Can you elaborate on what might be the culprit in the training routine or other hyperparameters, then, if simply changing the optimizer from torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=0.0005) to torch.optim.Adam(params=model.parameters(), lr=0.0001) seems to alleviate my problems (where there was zero improvement in accuracy, there is now much improvement!) in the small, two-class sample run?

Additionally, I found the more or less “canon” implementation. I see they mention using SGD in the wiki

--mini 128	train using SGD with minibatch of 128 examples

but aside from that I cannot untangle where/how it’s being done. I see the learning rate stuff in the code, but nothing related to SGD with “momentum of 0.9” as specified in the paper and mentioned in passing in the wiki.

Here’s where I think they do all the training cuda-convnet2/ at 3238bf0367f63eb370e897b9e5714794cb67ddc2 · akrizhevsky/cuda-convnet2 · GitHub ; perhaps the SGD stuff is obfuscated behind the C++ model they loaded up?? It’s hard for me to decipher what this code is doing, tbh

    def import_model(self):
        lib_name = "cudaconvnet._ConvNet"
        print "========================="
        print "Importing %s C++ module" % lib_name
        self.libmodel = __import__(lib_name,fromlist=['_ConvNet'])

Anyways, from what else I’ve seen in looking at how people have implemented Alexnet in PyTorch, they consistently seem to use Adam. For instance, this piece of code which I did not see/haven’t looked at until just now, after I ran into this problem in my own implementation alexnet-pytorch/ at d0c1b1c52296ffcbecfbf5b17e1d1685b4ca6744 · dansuh17/alexnet-pytorch · GitHub.

I think I/we are missing something nuanced about the optimizer, I’m just not sure what… any help would be appreciated (for now, though, I’ll use Adam :slight_smile: )

I’m not an expert when it comes to the nuances of different optimizers, but from the past I’ve seen that while SGD could yield a lower final loss, making it train/converge might be harder than using a more sophisticated optimizer. As a side note: when trying to debug some code, my default optimizer is always Adam, as it’s usually easy to show if the code has a bug somewhere (e.g. by accidentally detaching tensors) or if the optimization itself is not working due to a bad hyperparameter set.
With that being said, you might want to adapt the learning rate in SGD and see, if this could help.

Interesting about Adam being the go-to, I will remember that. No problem and thanks for the help, last thing, would you be able to direct me to anywhere that can educate me on the “nuance” of these optimizers (beyond the basics; I’ve taken a look at the docs and the Adam paper but didn’t learn much practical knowledge to tackle whatever issue I’m seeing)?

With that being said, you might want to adapt the learning rate in SGD and see, if this could help.

Also, I tried this, as well as other tweaks to the SGD optimizer, to no avail :frowning: I’m currently baffled

Yeah, but also be careful about it and don’t limit yourself, as I’m usually not working on training an entire model end2end, but just debug some convergence issues. Adam might not be the go-to anymore.

You should check the current literature on optimizers in ML and also refer to knowingly working repositories to see, what they are using.

@rwightman would know a lot more about successfully training state of the art CV models.

1 Like

The specific optimizer shouldn’t impact the ability to converge vs not in most cases. It should be possible to use any of the common optimizers for this task, it’s usually a matter of getting all the details right and searching over your hparams (if you aren’t starting from known good defaults).

That said, AlexNet is difficult to train and most adaptive optimizers tend to be more forgiving in challenging situations or non-optimal hparams. So, first q would be why Alexnet? Using a net that has normalization layers (ie BatchNorm) and residual connections will make it significantly easier to train. And if it must be Alexnet, you probably need to dig through the original impl and make sure your weight init is close to the original (which is likely not the case anymore with the pytorch default impl), you also need to use the correct batch size vs LR. I think the original was 128 batch size for .01, but not clear if that 128 factored in the 2x GPU so it miight be 256 and .01 equivalent. Being off with SGD even a small amount can mean instability with a net like this.


Thanks for all the tips about BatchNorm, weight inits, and the thought about the 2x GPU. I didn’t realize SGD was so sensitive without these considerations.

Last question: is there a succinct explanation for why Adam optimizes Alexnet handily given many different hyper-params (which I’ve tried) while SGD is stubborn? I’ve looked over the Adam paper and understand it estimates “exponential moving averages of the gradient and the squared gradient”, but I don’t understand how this takes care of an issue which, like you say, BatchNorm and other nuances about the batchsize could solve.