Unable to overfit on a very small training data

zimmer550 · November 2, 2019, 3:10am

Hello. I don’t know where else to ask this. Basically, I successfully extract features from a sequence of images and then concatenate this feature vector with a vector of bounding box coordinates and I put this long vector through a two-layer fully-connected layer to obtain bounding box coordinates of the future.

I wanted to see whether I can overfit a small subset of my data. So, I use this method on a sequence of just 150 images. I am unable to overfit it, in fact the loss function keeps increasing. I have done a lot to try to deal with this. I normalize the input to my fc-layer by using torch.nn.functional.normalize(). My fc-layer is a two-layer network which takes a 1028 vector as input and has one hidden layer with size 516 and the output is a vector of 4 values which are added to the input to predicit the future location of the bounding boxes.

Please tell me what kind of problems could I be facing here? When something like this happens, does it mainly mean that there is no sensible relationship between the input and output or could some hyperparamter tuning fix this? (I also increased and decreased the learning rate using Adam without any weight decay)

ptrblck · November 2, 2019, 8:30am

If your loss increases, could you check you’ve zeroed out the gradients via optimizer.zero_grad() or model.zero_grad()?

If you didn’t forget it, could you post some information regarding your training, e.g. which criterion are you using?
Note that nn.MSELoss was automatically broadcasting the target or output without a warning in older PyTorch versions. You should get a warning (since 1.1.0?) now.

If neither is the case, you might need to tune the hyperparameters further.

zimmer550 · November 2, 2019, 4:56pm

Loss is oscillating around some value but that is usually very large. Some more context below:

So, I have a pre-trained RNN encoder decoder architecture which extracts a feature vector from a sequence of images. I trained those using self-supervised learning (more details in this paper: https://arxiv.org/pdf/1909.04656.pdf) and I am confident they are extracting the correct features because their loss function (after just 5 epochs) drops to 1.5 or something.

So, in order to predict the bounding box coordinates of an agent (I am basically trying to do what nvidia is working on here: https://blogs.nvidia.com/blog/2019/05/22/drive-labs-predicting-future-motion/), I extract features from the sequence of entire images, then I extract features from the sequence of the agent’s image and then I concatenate these two feature vectors with the most recent bounding box coordinates of the agent. This long vector has a length of 1028.

I put this one long vector through a fully connected layer of hidden size 512 and output size of 4. The outputs are not the raw bounding box coordinates in the future but delta values that are added into the input to get the future bounding box coordinates.

In every attempt, I noticed that my output would start with very small values (0.00012 etc) and then, within just 5 epochs they would get to huge values around 9.23 or -23.41 etc. for every agent and then they would just linger there and not change much at all.

I have double checked everything, I normalize the two feature vectors and the most recent bounding box coordinates separately before concatenating them (normalized using torch.nn.functional.normalize() and I double check that they are normalized by summing all elements along a row for each vector). I increase and decrease the learning rate (Adam and I also try having some weight decay but no change).

The only optimizer here is the optimizer for the fully connected layer which I set to zero before the main loop where I add up the values of all the loss functions before doing backprop outside the loop (I only do fc_optimizer.zero_grad() and not fc_model.zero_grad(), is there any difference between the two?)

Honestly, right now I am just trying a completely different approach since if my model can’t even overfit 150 images then the model simply can’t capture the relationship between the current input and desired output.