Model loss not decreasing even after increasing learning rate

slug · July 21, 2025, 12:02pm

I’m fairly unfamiliar with pytorch (and ML in general) so please do bear with me. I’ve written a CNN that takes input 1x64x200 to predict between 7 labels (labelled 0-7). My model is as follows:

class Net(nn.Module):
    def __init__(self,dropout):
        super(Net,self).__init__()
        self.conv1=nn.Conv2d(1,32,7,stride=1,padding=3)
        self.conv2=nn.Conv2d(32,16,7,stride=1,padding=3)
        self.conv3=nn.Conv2d(16,8,5,stride=1,padding=2)
        self.conv4=nn.Conv2d(8,16,5,stride=1,padding=2)
        self.conv5=nn.Conv2d(16,4,3,stride=1,padding=1)
        self.pool1=nn.MaxPool2d(2)
        self.pool2=nn.MaxPool2d(4)
        self.fc1=nn.Linear(800,7)
        self.dropout1=nn.Dropout2d(dropout)
        self.batchnorm1=nn.BatchNorm2d(32)
        self.batchnorm2=nn.BatchNorm2d(16)
        self.batchnorm3=nn.BatchNorm2d(8)
        self.batchnorm4=nn.BatchNorm2d(4)

    def forward(self,x):
        x=self.conv1(x)
        x=self.batchnorm1(x)
        x=self.dropout1(x)
        x=nn.functional.relu(x)

        x=self.conv2(x)
        x=self.batchnorm2(x)
        x=self.dropout1(x)
        x=nn.functional.relu(x)
        x=self.pool1(x)

        x=self.conv3(x)
        x=self.batchnorm3(x)
        x=self.dropout1(x)
        x=nn.functional.relu(x)

        x=self.conv4(x)
        x=self.batchnorm2(x)
        x=self.dropout1(x)
        x=nn.functional.relu(x)

        x=self.conv5(x)
        x=self.batchnorm4(x)
        x=self.dropout1(x)
        x=nn.functional.relu(x)
        x=self.pool2(x)

        x=torch.flatten(x,1)
        x=self.fc1(x)
        x=nn.functional.relu(x)
        return x
    


def train_test(net,epochs,train_loader,test_loader,device):
    criterion=nn.CrossEntropyLoss()
    optimizer=optim.Adam(net.parameters(),lr=3e-3)
    train_acc=[]
    train_loss=[]
    test_acc=[]
    test_loss=[]
    net.to(device)
    for epoch in tqdm.tqdm(range(epochs)):
        net.train()
        running_loss=0.0
        correct,total=0,0
        for i,data in enumerate(train_loader,start=0):
            inputs,labels=data
            inputs=inputs.to(device).float()
            labels=labels.to(device).long()
            
            #train
            optimizer.zero_grad()
            outputs=net.forward(inputs)
            loss=criterion(outputs,labels)
            loss.backward()
            optimizer.step()

            running_loss+=loss.item()
            #training accuracy
            _,predicted=torch.max(outputs,1)
            total+=labels.size(0)
            correct+=(predicted==labels).sum()
        train_loss.append(running_loss/len(train_loader))
        train_acc.append(correct/total)
        print(f"epoch {epoch} --> TRAIN loss: {running_loss/len(train_loader):.5f}, TRAIN accuracy: {correct/total:.2f}")

        #eval on test
        net.eval()
        running_loss=0.0
        correct,total=0,0
        for inputs,labels in test_loader:
            inputs,labels=inputs.to(device).float(),labels.to(device).long()
            outputs=net.forward(inputs)
            loss=criterion(outputs,labels)
            running_loss+=loss.item()

            #test acc
            _,predicted=torch.max(outputs,1)
            total+=labels.size(0)
            correct+=(predicted==labels).sum()
        test_loss.append(running_loss/len(test_loader))
        test_acc.append(correct/total)
        print(f"epoch {epoch} --> TEST loss: {running_loss/len(train_loader):.2f}, TEST accuracy: {correct/total:.2f}")

    return train_loss,train_acc,test_loss,test_acc


batch_size=150
test_data = TensorDataset(X_test, y_test)
test_loader = DataLoader(test_data, batch_size=batch_size,
                         shuffle=False
                         )

train_data = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_data,
                          batch_size=batch_size,
                          drop_last=False,
                          shuffle=True
                          )

Training with a low learning rate (1e-4) leads to training loss oscillating at a high value (around 1.95). Increasing the learning rate, even to 0.1, doesn’t decrease the training loss. I can’t tell if my model is stuck at a local minima or if there something else fundamentally wrong with my model design or code.

Any help would be greatly appreciated. Thank you!

training at learning rate=0.1:

  0%|          | 0/50 [00:00<?, ?it/s]
epoch 0 --> TRAIN loss: 1.97984, TRAIN accuracy: 0.14
  2%|▏         | 1/50 [00:18<14:43, 18.02s/it]
epoch 0 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 1 --> TRAIN loss: 1.94591, TRAIN accuracy: 0.14
  4%|▍         | 2/50 [00:36<14:40, 18.33s/it]
epoch 1 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 2 --> TRAIN loss: 1.94591, TRAIN accuracy: 0.14
  6%|▌         | 3/50 [00:54<14:17, 18.24s/it]
epoch 2 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 3 --> TRAIN loss: 1.94591, TRAIN accuracy: 0.14
  8%|▊         | 4/50 [01:13<14:01, 18.30s/it]
epoch 3 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 4 --> TRAIN loss: 1.94591, TRAIN accuracy: 0.14
 10%|█         | 5/50 [01:31<13:40, 18.24s/it]
epoch 4 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 5 --> TRAIN loss: 1.94591, TRAIN accuracy: 0.14
 12%|█▏        | 6/50 [01:50<13:31, 18.43s/it]
epoch 5 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 6 --> TRAIN loss: 1.94591, TRAIN accuracy: 0.14
 14%|█▍        | 7/50 [02:08<13:15, 18.51s/it]
epoch 6 --> TEST loss: 0.49, TEST accuracy: 0.14

training at learning rate=1e-3:

  0%|          | 0/50 [00:00<?, ?it/s]
epoch 0 --> TRAIN loss: 1.94748, TRAIN accuracy: 0.15
  2%|▏         | 1/50 [00:18<15:06, 18.50s/it]
epoch 0 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 1 --> TRAIN loss: 1.94572, TRAIN accuracy: 0.14
  4%|▍         | 2/50 [00:36<14:43, 18.40s/it]
epoch 1 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 2 --> TRAIN loss: 1.94623, TRAIN accuracy: 0.14
  6%|▌         | 3/50 [00:54<14:17, 18.25s/it]
epoch 2 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 3 --> TRAIN loss: 1.94588, TRAIN accuracy: 0.14
  8%|▊         | 4/50 [01:13<13:59, 18.25s/it]
epoch 3 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 4 --> TRAIN loss: 1.94597, TRAIN accuracy: 0.14
 10%|█         | 5/50 [01:31<13:45, 18.35s/it]
epoch 4 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 5 --> TRAIN loss: 1.94591, TRAIN accuracy: 0.14
 12%|█▏        | 6/50 [01:50<13:29, 18.40s/it]
epoch 5 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 6 --> TRAIN loss: 1.94586, TRAIN accuracy: 0.14
 14%|█▍        | 7/50 [02:08<13:13, 18.46s/it]
epoch 6 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 7 --> TRAIN loss: 1.94593, TRAIN accuracy: 0.14
 16%|█▌        | 8/50 [02:27<12:56, 18.49s/it]
epoch 7 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 8 --> TRAIN loss: 1.94591, TRAIN accuracy: 0.14
 18%|█▊        | 9/50 [02:45<12:39, 18.53s/it]
epoch 8 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 9 --> TRAIN loss: 1.94592, TRAIN accuracy: 0.14
 20%|██        | 10/50 [03:04<12:22, 18.56s/it]
epoch 9 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 10 --> TRAIN loss: 1.94590, TRAIN accuracy: 0.14
 22%|██▏       | 11/50 [03:23<12:08, 18.67s/it]
epoch 10 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 11 --> TRAIN loss: 1.94597, TRAIN accuracy: 0.14
 24%|██▍       | 12/50 [03:41<11:47, 18.61s/it]
epoch 11 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 12 --> TRAIN loss: 1.94591, TRAIN accuracy: 0.14
 26%|██▌       | 13/50 [04:00<11:28, 18.60s/it]
epoch 12 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 13 --> TRAIN loss: 1.94592, TRAIN accuracy: 0.14
 28%|██▊       | 14/50 [04:19<11:09, 18.60s/it]
epoch 13 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 14 --> TRAIN loss: 1.94594, TRAIN accuracy: 0.14
 30%|███       | 15/50 [04:37<10:50, 18.59s/it]
epoch 14 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 15 --> TRAIN loss: 1.94588, TRAIN accuracy: 0.14
 32%|███▏      | 16/50 [04:56<10:29, 18.52s/it]
epoch 15 --> TEST loss: 0.49, TEST accuracy: 0.14
epoch 16 --> TRAIN loss: 1.94593, TRAIN accuracy: 0.14
 34%|███▍      | 17/50 [05:14<10:09, 18.48s/it]
epoch 16 --> TEST loss: 0.49, TEST accuracy: 0.14

KFrank · July 21, 2025, 3:35pm

Hi Slug!

slug:

        x=self.fc1(x)
        x=nn.functional.relu(x)
        return x
I can’t tell if my model is stuck at a local minima or if there something else fundamentally wrong with my model design or code.

First, try getting rid of that final relu(). You should be feeding the output of your final
Linear layer directly into your CrossEntropyLoss loss criterion.

epoch 6 --> TRAIN loss: 1.94591, TRAIN accuracy: 0.14
 14%|█▍        | 7/50 [02:08<13:15, 18.51s/it]
epoch 6 --> TEST loss: 0.49, TEST accuracy: 0.14

…

epoch 16 --> TRAIN loss: 1.94593, TRAIN accuracy: 0.14
 34%|███▍      | 17/50 [05:14<10:09, 18.48s/it]
epoch 16 --> TEST loss: 0.49, TEST accuracy: 0.14

Your output suggests that you are training for maybe seven or seventeen or fifty epochs.
Try training for a lot longer.

Adam is a good optimizer, but I recommend starting with plain-vanilla SGD. It can be
more stable (even if sometimes slower) and is easier to reason about.

See if you can overfit on a single batch or a small number of batches.

Some questions:

How large is your training set? How balanced is your training set? Do each of your seven
classes have about the same number of samples?

Could a person looking at your data do a good job of classifying the input samples?

Good luck!

K. Frank

slug · July 22, 2025, 5:28am

Hi Frank, thank you so much for your guidance! The model loss started decreasing (very slowly) after I removed that last relu(). I’ll switch to sgd in a bit to see if that works better.

For context, I’m working on EEG data trying to classify emotion based data sampled at 200hz over a 30 second interval (so an interval of 6000 timepoints) where a subject is watching a video to elicit some emotion. Each 30 second sample is labelled with an emotion with a total of 7 different labels.

21 recordings (3 per emotion) were made per subject (so [21 (trials), 64 (eeg channels), 6000]). I’ve split the 30 second sample into 1 second samples (which increases my sample size by 30x), leaving me with [63,64,200] trials per subject (with an equal number of trials per label). Considering 40 subjects nets me 25200 samples (before splitting into test/training/validation sets), with shape [25200,1,64,200]. (this is what i’m currently training with)

I was told that i might be better off splitting every sample into 5 samples with some redundancy (so 0-10s, 5-15s, 10-20s, 15-25s, 20-30s per 30s sample) which would only x5 my sample size, but should be more likely to better capture the labelled emotion. Doing this only nets me 4200 samples for 40 subjects (which i worry might not work well given model complexity).

Given the nature of the dataset its not really possible to elucidate the labelled emotion from just looking at the EEG (maybe an expert could? i definitely can’t).

Cheers!

KFrank · July 22, 2025, 7:13pm

Hi Slug!

As I understand it you are passing a 64x200 “image” into your initial Conv2d. This seems
wrong to me as your 64 eeg channels don’t have a natural linear ordering. It would make
more sense to me to use a Conv1d with in_channels = 64. (Conversely, it does make
sense, as you are doing, to convolve over the time dimension.)

You could argue that the locations of the eeg electrodes on the scalp do have some real
two-dimensional structure. You could kind-of, sort-of map your 64 electrode locations to
an 8x8 2d grid. You could then pass an 8x8x200 3d “image” to a Conv3d. Certainly two
electrodes over the left ear will be more correlated with one another than either will be with
an electrode over the right eyebrow. In any event, I would start with a 64-channel Conv1d
as the straightforward baseline and only move on to some Conv3d scheme if you can
show that it actually works better.

This strikes me as offering illusory benefit at best and likely being counter-productive.

Consider the following analogy: I’m building a cat-vs-dog classifier that takes 64x64
images. But I train an 8x8 classifier on sub-images obtained by breaking up my original
images into 64 8x8 sub-images each. Even though I’ve (artificially) increased the number
of training “samples” by a factor of 64, I haven’t in any sense increased the amount of
information in the training set. In fact, I’ve degraded it because the spatial relationship
of the sub-images that come from any single original image has been lost.

As a general rule of thumb, I think if it’s hard (or next to impossible) for a person to do it,
it will be (quite) hard to train a network to do it. You will likely requires a goodly amount of
training data, long training times, and an appropriate network architecture.

Leaving aside the issue of splitting your samples, I would say that you have 840 samples
(with 120 per class). This seems to me to be a reasonable amount of data, but not really
“a lot.” As you train longer and longer on a somewhat limited data set, it becomes a race
between your network learning your potentially-difficult classification problem and your
network “memorizing” specific training samples (i.e., overfitting).

I would start by seeing whether you can get good loss and accuracy results on your
training set, even if you have to train for a long time to do it. If your test / validation set
indicate that overfitting is occurring, I would cross that bridge when I got to it.

Best.

K. Frank