I’m running into a problem with my U-Net, I’m trying to train my model to capture people, but right now I’m reaching towards a dead end. I can’t find any way to improve training and validation, I’m sorta lost now.
To summarize my model, data, and anything else I can think off, my model is a normal U-Net that uses TanH and Group Normalization for each encoder and decoder block. My data comes from the COCO Dataset, which I add some padding, around 100-200 pixels of padding to either height or width depending on the image shape, to have it around 636 x 636, I then normalize the input values values to fit in between [-1,1] and then de-normalize it back to [0,1] for BCELoss() function. Additionally, I have about 1613 images to train or validate, and thinking about this now as I am typing this, could it be the lack of data I have? Not sure what the problem is but any insight or tips can help out!
I would not denormalize. I would also suggest that you use BCEWithLogitsLoss, rather than BCELoss, as your loss criterion. (It’s numerically more stable.)
What are your target values? For BCEWithLogitsLoss (as well as BCELoss) you would
typically want them to be 0.0 or 1.0.
This seems smallish, but it could be (or might not be) enough. In any event, you should be
able to overfit in that you get a low loss value (and a good figure of merit) on your training
set, even if your validation-set results start to get worse as you keep training.
Am I right that by “capture people” you mean that you want to label pixels that are part of
a person in the original image as “foreground” (e.g., 1) and pixels that are not part of a
person as “background” (e.g, 0)?
If you have many more background pixels than foreground pixels (or vice versa), you should
probably make use of BCEWithLogitsLoss’s pos_weight constructor argument.
There are some nuances to U-Net, but try without the denormalization and try overfitting
before getting into the weeds.
I used BCEWithLogitsLoss before and will try and re-use it again. Also, why does de-normalize data not be a good idea here? And are there other times where it is allowed?
I increase data from 1613 to around 5000, 3000 images, and 2000 images flip.
My target values is 0 and 1, and yes, people highlighted is foreground and everything else background.
Moreover, I added an extra layer normalization for each block, so now each block has 2. Other than that, I haven’t run my program yet, but I’ll soon report back with an update here.
After making these changes, as well as changing the resolution to 556 x 556, running 25 epoch, I can say there being a massive improvement of having 0.50 train and validation loss in the first epoch to around 0.35 for train and epoch at 20 epoch, with validation having a bit more dispersion then train at +0.1 or +0.2. I’m going to try and reduce padding and run my program to a bigger computer.