Non-reproducible results during evaluation


I have trained a bidirectional GRU and during evaluation (when I load the weights from the checkpoint) I receive every time different results even though the input is exactly same and I run on cpu. Dropout is removed, eval mode is set. I checked everything as it seems, but I don’t know where to look already.

One silly reason (which has happened to me) could be because of the transforms in your loader. If the loader has shuffle = True, or something like RandomHorizontalCrop() then the input to the network is changing every time you run your code, so naturally, the output changes slightly every time.

Thx, but the input is not processed. I get different results running GRU over the same input each time I run the script

Not sure if this’ll help at all, but what happens when you manually set the rng seed to something like 0?

This should not matter for the evaluation, right? Especially on cpu. As there is not randomness (dropout, cudnn )
Though I tried to fix numpy and torch random seeds and still there is a variation during GRU forward pass call.

By not preprocessed you mean you have no transforms at all?

Can you post a minimally working example? Usage of dict at some point in your code may cause this kind of problem also.