I am using pytorch geometric to predict specific chemical properties (regression task) using the GraphGPS repo (GitHub - rampasek/GraphGPS: Recipe for a General, Powerful, Scalable Graph Transformer).
For sanity check, we did an overfitting test with only one molecule (same sample for training and val), and we realized that when Layer Normalization is active at a certain point of the training, the validation stops, but training continues decreasing. If we deactivate the Layer Norm, this effect is not happening. Why does the loss differ from val to train if it is the same sample?
As far as I know, Layer Norm is the same during training and validation. It is not like batch norm that uses running mean and var from the training samples.
Also, we want to understand why the loss is not going to zero the loss with only one sample, no matter the learning rate or the number of epochs. Is it common in regression tasks? Do you assume some error? (Yes, we know it’s a low error, but it is an overfitting test with only one sample). Is it because of the model’s capacity or a bug in the code? Why the loss cannot go lower?