I am a newcomer to practical ML and PyTorch. I’m working on training an image discrimination model based mostly on textural features. My current model is very simple: 3 convolutional layers (plus ReLu/MaxPool), 3 fully connected layers. My latest validation gives me an accuracy of 61% for 5 classes, which is not sufficient for my application.
Are there any guidelines for modifying my model to improve this performance? To someone as naive as I am, it appears there are almost infinite dimensions for experimentation. I could increase the number of filters in each layer (make the layers “broader”). I could increase the number of layers (“deeper”). I could try using dropout; I currently have about 1300 inputs per class, so I could well be having overfit problems. Of course I could also experiment with the changing the loss function (currently Cross-Entropy Loss) or the optimizer (Stochastic Gradient Descent), or other parameters.
Can anyone offer suggestions on a systematic way to evaluate these many alternatives (as opposed to trial and error!) Or are there well-known rules of thumb as to what sort of modifications are likely to be most effective?
I realize this is a very general question, compared to most in this forum. I hope it will generate some useful discussion.
Hi @Sally_Goldin, welcome to ML with Pytorch,
Before checking the below suggestion, please make sure that you are indeed overfitting. If yes, skip the step A.
Step A. Make your Model overfit
According to my experiences, I tend to do these things:
- Try to make the model overfit (which is currently not optimized on your model)
- Use pretrained model (this will HUGE-ly increase the performance and the speed of your training time. You could use these available models (*thanks to the community)
- *Try to avoid the regularizations as much as you can at first, aka Dropout, L2/L1 norm, or other fancy ones on the internet.
- In my experience, try to avoid using Batchnorm on the last layer.
- Double check your final activation layer (in your case, you should have used “softmax” CE since your goal is image recognition/discrimination).
- Try to add batchnorm2d on your convolution layers (in case you are using custom model).
- Try to normalize your image (in case you haven’t normalized it, you could normalized it [0,1] or [-1,1] etc).
(These 7 steps should basically solved most of your initial issue, which is to make your model very good on your training data)
Step B. Make your Model generalized on the unseen data
- This is where you should do generalizations*. There are many resources on the internet for your starting point to save your time.
- Double check your training data’s distribution, make sure that they have similar total images per-class (this is often happening).
- If it is still not working, you might gonna need to check your image, make sure that they are recognizable (this is often happening in my prev. projects).
Hope it helps, cheers~
Maybe I wasn’t clear. The model classifies the training data more or less perfectly, but the randomly chosen 20% holdout images for validation aren’t very accurate.
There is no problem with balance of samples across the classes.
Also, the training is quite fast (a few seconds per epoch) because I am preloading the training images into the GPGPU.
I am trying some experiments with dropouts now. Definitely slows down the learning. Meanwhile, I welcome other suggestions. Thanks!
In this case your model is overfitting (which is a good start, as your model is at least able to perfectly predict the training set ).
You would have to add more regularization or decrease the model capacity e.g. using
- data augmentation
- weight decay
- smaller model
If I remember your last topics correctly, you are using image data (and are preloading to save time).
Unfortunately, the preloading onto the GPU won’t allow you to use
torchvision.transforms for the data augmentation and you could manually crop, flip, rotate, change the brightness, saturation etc. of the images.
How would a smaller model increase regularization? (My apologies if this seems like a silly or obvious question!) By smaller do you mean fewer layers? Fewer filters per layer?
It wouldn’t increase the regularization, but would decrease the model capacity.
Dropout can also be seen as a method to decrease the model capacity during training, as each training iteration will use a new and “smaller” model than the original one.
I would nevertheless try to increase the regularization first (dropout and data augmentation), before using a smaller model.
I’ve been experimenting with dropout over the past couple of hours. Based on tracking of minimum loss, it seems to have reduced the model’s ability to learn to near zero! BTW I really appreciate how helpful you and the other contributors are on this forum.
Do you mean that your training accuracy is now really bad after you’ve added these dropout layers?
If so, did you specify the drop rate or used the default
0.5? Also, where did you add these layers?
Usually you could add dropout between linear layers (not as the last output layer).
With the dropout, the loss was barely declining, even with thousands of epochs. I used a dropout of 0.3, but between the second and third convolutional layers. I can go back and try a dropout between the linear layers instead.
Meanwhile, I’m currently doing augmentation, manually, to double the number of training samples.
I would stick to one of the pre-trained models such as resnet50 and concentrate on hyperparameter optimization: augmentation, weight decay, batch size, learning rate etc.
Thanks for your suggestion, but based on my admittedly limited understanding of transfer learning, I don’t think the pre-trained models are appropriate for this application. This is not object category recognition or anything similar.
I do not know if this will help but here is a good link to understand dropout better! https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/
Thank you. Excellent article.
I have now added a dropout between the first and second fully connected layer. As expected, the training is much slower, but I am approaching an accuracy level that is appropriate to the task.
Just a side note - the really terrible performance I was seeing earlier was not due to the dropout but to my misunderstanding of BTNug’s suggestion. I had also added a softmax, not realizing that my loss function, Cross Entropy, ALREADY does a softmax transformation. When I removed the extra softmax, the training behaved more reasonably.
Thanks for your suggestions.