If I understand the comparison correctly, case 1 would be the “working case” with a high test accuracy and cases 2 and 3 are where the accuracy drops.
To compare the intermediate activations you could use this code to register the forward hooks for each layer. After you’ve passed the torch.ones tensor through the model, you could store the activations for both runs (e.g. case 1 and 2) and compare them afterwards in another script.
This would narrow down where the different output of coming from.
Also, how large is the current difference of your output?