Before getting to loss, there are a few things that could be improved:
- If you are performing a classfication task (predicting label probabilities), I would use cross-entropy loss. In pytorch, the builtin nn.CrossEntropyLoss module takes raw output from a model, before any softmax is applied. So if you use the pytorch module (I think you should), you shouldn’t run the final layer through softmax nor sigmoid. To be honest, I can’t think of a good reason to apply sigmoid and then softmax on any layer in the first place. Sigmoid squeezes a single raw value to be between 0 and 1 . Softmax squeezes a set of raw values to each be between 0 and 1, but add up to 1. When you use both together, you are still getting outputs between 0 and 1 that add up to 1, but you have unknowingly added other non-useful constraints. It looks like you implemented a custom MSE loss within the training loop. There is a pytorch version nn.MSELoss module, which I recommend using for regression problems. Again, this looks like a classficiation problem, so I would use nn.CrossEntropyLoss.
- You should use the pytorch nn.Linear module for linear layers in the model. You can find many examples of the pytorch Linear module being used in neural networks. It automatically initializes both weights and biases.
- It looks like you are performing stochastic gradient descent (gradient descent steps taken on part of the training data instead of on all training data at once). This is a good method if your data is too large, but you are taking gradient descent steps on a single data point at a time. In your training loop, for each epoch you should randomly split the data into a set of batches, and run a whole batch through gradient descent steps instead of single data points. I would read up on epochs, batches, and stochastic gradient descent.
Other miscellaneous notes:
- When saving the total loss, it might make more sense to store the loss of each step in a list. So define
total_loss=[]
before training andtotal_loss.append(loss.item())
at each step. This way, you can see how the loss of the model at each step is different. The loss should go down. If you sum the total loss over time, it can only increase - A learning rate of
.05
might be pretty high, depending on the application. You could try lowering this to maybe1e-3
. - You should check out the torch.flatten function for converting a 2D image tensor into a 1D vector tensor.
- It is also helpful to track accuracy alongside loss for classification problems.
- You do not need to call
net.forward(X)
. When using a pytorch module, you can simply callnet(X)
to get a forward pass. - You don’t need one-hot encoding for cross entropy loss. It takes a matrix of raw predicted values (converted to probabilities with the CrossEntropyLoss module) and a vector of true classes (as integers).
Making these changes alone could resolve the loss issue.