GoogleNet-LSTM, cross entropy loss does not decrease

class googleNet(nn.Module):
  def __init__(self, latent_dim=512):
    super(googleNet, self).__init__()
    self.model = torch.hub.load('pytorch/vision:v0.10.0', 'googlenet', pretrained=True)

    #freeze paramters (trains faster and keeps weight values of ImageNet)
    for params in self.model.parameters():
      params.requires_grad = False

    #change last fully completerd layer
    self.model.fc = nn.Linear(self.model.fc.in_features, latent_dim)

  def forward(self, x):
    output = self.model(x)
    return output

class Lstm(nn.Module):
  def __init__(self, latent_dim = 512, hidden_size = 256, lstm_layers = 2, bidirectional = True):
    super(Lstm, self).__init__()
    self.latent_dim = latent_dim
    self.hidden_size = hidden_size
    self.lstm_layers = lstm_layers
    self.bidirectional = bidirectional
    self.Lstm = nn.LSTM(self.latent_dim, hidden_size=self.hidden_size, num_layers=self.lstm_layers, batch_first=True, bidirectional=self.bidirectional)
    self.hidden_state = None

  def reset_hidden_state(self):
    self.hidden_state = None

  def forward(self,x):
    output, self.hidden_state = self.Lstm(x, self.hidden_state)
    return output

class ConvLstm(nn.Module):
    def __init__(self, google, lstm, n_class = 10):
        super(ConvLstm, self).__init__()
        self.modela = google
        self.modelb = lstm
        self.output_layer = nn.Sequential(
            nn.Linear(2 * self.modelb.hidden_size if self.modelb.bidirectional==True else self.modelb.hidden_size, n_class),
            nn.Softmax(dim=-1)
        )

    def forward(self, x):
        batch_size, timesteps, channel_x, h_x, w_x = x.shape
        conv_input = x.view(batch_size * timesteps, channel_x, h_x, w_x)
        conv_output = self.modela(conv_input)
        lstm_input = conv_output.view(batch_size, timesteps, -1)
        lstm_output = self.modelb(lstm_input)
        lstm_output = lstm_output[:, -1, :]
        output = self.output_layer(lstm_output)
        return output

Above is the NN that I use and the following code is used to train it.

modela = googleNet()
modelb = Lstm()
modelc = ConvLstm(modela,modelb).to(device)
## Loss and optimizer
learning_rate = 5e-4 #I picked this because it seems to be the most used by experts
load_model = True
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(modelc.parameters(), lr= learning_rate) #Adam seems to be the most popular for deep learning


modelc.train()
for epoch in range(100): #I decided to train the model for 100 epochs
    loss_ep = 0
    
    for batch_idx, (data, targets) in enumerate(zip(features_train, labels_train)):
        data = data.to(device)
        targets = targets.to(device)
        ## Forward Pass
        optimizer.zero_grad()
        modelc.modelb.reset_hidden_state()
        scores = modelc(data)
        loss = criterion(scores,targets)
        loss.backward()
        optimizer.step()
        loss_ep += loss.item()
    print(f"Loss in epoch {epoch} :::: {loss_ep/len(features_train)}")

    with torch.no_grad():
        num_correct = 0
        num_samples = 0

The cross entropy through the 100 epochs is at 2.301. What is going wrong?
I have read that the crossentropy includes softmax and I removed it from the output layer but the cross entropy still stays on the same value.

Your code seems to be able to at least lower the loss for a simple use case of training random inputs to random targets:

device = 'cuda'
modela = googleNet()
modelb = Lstm()
modelc = ConvLstm(modela,modelb).to(device)
## Loss and optimizer
learning_rate = 5e-4
load_model = True
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(modelc.parameters(), lr= learning_rate)

data = torch.randn(16, 10, 3, 224, 224).to(device)
targets = torch.randint(0, 10, (16,)).to(device)

modelc.train()
for epoch in range(100):   
    optimizer.zero_grad()
    modelc.modelb.reset_hidden_state()
    scores = modelc(data)
    loss = criterion(scores,targets)
    loss.backward()
    optimizer.step()
    print("epoch {}, loss {}".format(epoch, loss.item()))

You could play around with this small use case (or use a small subset of your original data) and make sure your model is able to overfit it by playing around with hyperparameters.

Thank you for your response.
I did change the learning rate quite a lot and the loss decreases very slowly. Basically, it varies by a few 0.001 maybe but overall it stays constant at 2 dp

You said you removed it, but in the code you provided I still see a Softmax call inside your ConvLstm:

class ConvLstm(nn.Module):
    def __init__(self, google, lstm, n_class = 10):
        super(ConvLstm, self).__init__()
        self.modela = google
        self.modelb = lstm
        self.output_layer = nn.Sequential(
            nn.Linear(2 * self.modelb.hidden_size if self.modelb.bidirectional==True else self.modelb.hidden_size, n_class),
            nn.Softmax(dim=-1)  # <-- still here
        )

That’s a good point as I have also removed the softmax from the model definition for my quick experiment.

I had commented it out. Also, I used a really small subset and I changed the learning rate and carried the latent_dim and hidden_size values and still, the lowest possible error was 2.27…

EDIT: Deleting my post because in hindsight my observation is wrong.

Hello,
I have been trying a few changes but it seems that the result don’t change. I have decreased the classes used and the overall loss has decreased to 1.61 but again stays at 1.61 with a really small variation. Also, I have changed the optimiser to SGD and tried a few more changes and nothing changed. Is there a chance that the models used cannot be trained together?

Random things to check:

  • Are the modelb parameters part of the modelc parameters? You can see that by iterating through the modelc.parameters() (important, since that’s what’s passed to the optimizer). If the answer is “no” then that suggests an issue. If the answer is “yes”, can you just check that they are set to requires_grad = True after you set the model to .train()?
  • Same question but about modela.fc, which I think you want to train.
  • You are freezing the googlenet parameters inside its __init__. While this might be fine, I’m not sure if there are any hooks that can mess with this, I would guess it’s safer to freeze / unfreeze whatever you want after the model has been instantiated, not in the class declaration.
  • Class Lstm has the same object name as the attribute self.Lstm inside that class. While that may be totally fine due to the clear self namespace, I might rename it to something else just to avoid the possibility of any confusion / weird bug.
  1. I have used print(modelc) to get the parameters and I can see that all the layers are visible ie:
ConvLstm(
  (modela): googleNet(
    (model): GoogLeNet(
      (conv1): BasicConv2d(
        (conv): Conv....
(modelb): Lstm(
    (Lstm): LSTM(512, 256, num_layers=2, batch_first=True, bidirectional=True)
  )
  (output_layer): Sequential(
    (0): Linear(in_features=512, out_features=5, bias=True)
  )
)

For LSTM the same can be seen when you print just the LSTM model

  1. I can access modela.model.fc as well as modelc.modela.model.fc (I am not sure if you meant this)

  2. Do you mean after instantiate modela or modelc?

  3. I have changed it now

After these changes the error start at 0.40… after the first epoch and then flunctuates around 0.397 and 0.395. Does this sound correct?

For 3. yes I meant to do all layer freezing / unfreezing after all model instantiation. Please check that all the parameters that you want to train have requires_grad set to True, and all the ones that you want to be frozen have it set to False. This is most likely the case but it would be good to check.

Is the loss of 0.40 lower than what you had before? You were mentioning 1.61 before but I’m not sure if that’s apples-to-apples. Do you have a sense of what the loss ought to be if this model trained as expected? That’s hard to know from the outside without knowing the details of your problem etc.

modela has two variables that require grad. I guess are the weights and biases of last layer
modelb has 16 variables (all since it is not trained)
modelc has 20 variables (2 from models, 16 from modelb and 2 from the weights and bias of the last layer of model c)

It seems that I did a mistake with my data before and now the loss is back to 1.60-1.61.
What I try to do is design a human activity recognition using 10 classes of the ucf-101 dataset. My architecture uses the googlenet as the CNN and LSTM but I cannot figure out the issue.
My dataset now consists of 5 classes, and i use 60 videos fro each class and 15 frames from each video. The training to testing ratio is 3:1.
Also, the transform that I use for the images is:

tranform_train = transforms.Compose([transforms.ToTensor(), transforms.RandomHorizontalFlip(p=0.7)])

I used to have a normalization transform as well but I believe some issues arise in cross-entropy error. The only normalisation I do is that I divide the pixel value by 255.

At this point, something you should try would be to reduce your dataset to a really small subset of videos / frames and use that same data as both train and test. If you can’t (over)fit that data, something is not set up properly.

I think conventionally you do want to have an input normalization to match the GoogleNet parameters (like here) and I’m not aware of any issues with cross-entropy error, but regardless of the normalization you should be able to overfit a small data sample.

If you can overfit the small sample, but you still can’t get the model to work on the larger dataset, perhaps you just don’t have enough data / the right architecture / the right hyperparameters.