Transformer Encoder changes behavior with @torch.no_grad()


I am trying to investigate some irregular behaviors in my trained network. The encoder is a standard pytorch Transformer encoder. I have identified that, when adding the @torch.no_grad() decorator to a given function, the encoder outputs will change each time. For example:

def forward_1(self, x):
return self.encoder(x)

def forward_2(self, x):
return self.encoder(x)

If I set the model to eval() and run both functions above with an identical input, I will get to different outputs.

If I am not mistaken, with model.eval() all differentiating factors like dropout should be removed, and gradient computation itself should not be in any way affecting the outputs. In other words, shouldn’t the two functions above return the exact same Tensor?

Many thanks in advance.

It is expected that the numerics are different in different code paths, depending on the requires grad-ness of the inputs the encoder may dispatch to different sdpa backends.