Random State, Manual Seed & Oscillating Loss

Hello,

I have a simple model with three GraphConv layers (DGL) and one nn.Linear layer, trying to do predictions over a graph with 700 vertices (560 training vertices in a semi-supervised GCN setting). About 10 days ago, when I first wrote it, I was consistently getting high training/test accuracies of >= 95%. Something changed one of the days and I can’t figure out what (it is such an extremely simple model) that made it behave in a way where the loss would either go down painstakingly slowly or start oscillating. This meant that the accuracy numbers dropped below 40%. But this wasn’t consistent either. Once every 10-15 runs, the accuracy would shoot up to > 85% and then fall back down. All of this was on the exact same training set (same training vertices).

This made me suspect that perhaps something was up with the random state. So I set torch.manual_seed(0) at the start of the program. Now I have accuracy at 95%. Also, as is expected with a fixed RNG seed, the accuracy is consistent across runs. So, surely my hunch about this being related to RNG was right, but my fix feels like a band-aid on the symptom and not a real root cause analysis. As to the question of why I am starting at such a terrible state is simply beyond me. I also highly doubt that my simple use-case has uncovered an underlying PyTorch/DGL bug. So something is up with my model.

Does anyone have an obvious hunch, say without looking at the actual code itself? If you need to see the model itself, then I can post it here, but really it is as simples as:

nn.ReLU()(dgl.nn.pytorch.GraphConv(10, 10)) x 3
nn.Sigmoid()(nn.Linear(10, 4))

Also, this is possibly a DGL forum question, but since torch.manual_seed(0) seemed to alleviate the symptoms, I thought I might have better luck starting with a broader audience.

I have also cross-posted it in the DGL forum: Random State, Manual Seed & Oscillating Loss - Questions - Deep Graph Library

The dependency on the random seen could point towards e.g. a bad initialization of the model and I agree that it feels rather like a band-aid than a proper solution.

It’s hard to isolate the real root cause, since you don’t have the initial code anymore and thus cannot compare different runs. However, I would check the initialization as well as shuffling of the dataset etc., which are all depending on the seed.