PyTorch debug strange behavior

whoab · November 20, 2019, 9:13am

class MainModule(nn.Module):
...
self.proj = Mlp()
<end MainModule>


output = self.main_module(output)

class MainModule(nn.Module):
...
<end MainModule>

output = self.main_module(output)
output = self.proj(output)

These two bits of code should be exactly the same, yet they consistently produce very different results for every run. How can I debug this?

If I somehow can ensure that self.proj has the exact same weights (I don’t think setting the random seed works when the code itself is different from run to run) will let me at least see if the forward activations and gradient updates are different… how do I do that?

ptrblck · November 20, 2019, 9:03pm

I don’t fully understand the code.
In the first example self.proj seems to be a part of your MainModule class, while in the second you are calling two different functions?

If both models have the same layers, I would suggest do load the state_dict from one model to the other model, thus making sure all parameters are exactly the same. Then use a constant input tensor (or set the seed before creating a random one), and check the difference in the outputs.
Note that an absolute error of ~1e-6 is most likely due to limited floating point precision.

Also, if one or both models contains e.g. nn.Dropout layers, you should disable it by calling model.eval() in them first.