Different predictions in eval() and requires_grad=False

Hello all,

I am getting different predictions for following case:

m0 = torch.load("path_to_model")
m0.eval()

m1 = torch.load("path_to_model") # I use the same path
for param in m1.parameters():
  param.requires_grad = False

prediction0 = m0(datum)
prediction1 = m1(datum)

prediction1 is consistently more accurate than prediction0.

  • Printed out the sum of weights for m0 and m1. Bias terms are exactly the same but all other weights are different. Also, from datum to datum, weights do not change for both models.
  • The only stochastic component is a dropout within the model, but during training, it was set to 0.
  • Datum is just raw tokenized text.
  • Model consists of nn.Linear, nn.Embedding and nn.LSTM layers.
  • All of this code segment is running inside another model during training. I wonder if that could be the reason.

Any pointers on this?

Do you get the same results, if both models are in eval()?
Are you using any other layers like nn.BatchNorm?

It does not matter to switch to eval() before or after requires_grad=False for m1. I do not use nn.BatchNorm.

I cannot reproduce this issue with this dummy code:

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 3, 1, 1)
        self.conv2 = nn.Conv2d(6, 12, 3, 1, 1)
        self.drop = nn.Dropout(p=0.0)
        self.fc = nn.Linear(12*7*7, 10)
        
    def forward(self, x):
        x = F.max_pool2d(F.relu(self.conv1(x)), 2)
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(x.size(0), -1)
        x = self.drop(x)
        x = self.fc(x)
        return x

model0 = MyModel()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model0.parameters(), lr=1e-2)

data = torch.randn(100, 3, 28, 28)
target = torch.empty(100, dtype=torch.long).random_(10)

# Train for some epochs
for epoch in range(10):
    optimizer.zero_grad()
    output = model0(data)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()
    print('Epoch {}, loss {}'.format(epoch, loss.item()))


# Save state_dict
torch.save(model0.state_dict(), 'tmp.pt')

# Reinitialize both models
model0 = MyModel()
model0.load_state_dict(torch.load('tmp.pt'))
model1 = MyModel()
model1.load_state_dict(torch.load('tmp.pt'))

# Disable gradients for model1
for param in model1.parameters():
    param.requires_grad_(False)

model0.train()
output0_train = model0(data)
model0.eval()
output0_eval = model0(data)

model1.train()
output1_train = model1(data)
model1.eval()
output1_eval = model1(data)

print((output0_train == output0_eval).all())
print((output1_train == output1_eval).all())
print((output0_eval == output1_eval).all())
print((output0_train == output1_eval).all())

Could you compare your code to mine and check for differences?
Also, if possible could you try to create a reproducible code snippet so that we could debug it further?

Thanks for your help! The only difference is I save/load the model without the state_dict. I’ll do that change to see if it makes any difference.

Following up on this topic in case anyone else has the same problem:

Trained a SyncNet model on custom data - loaded during inference while using requires_grad=False produced unreliable results, half-way between convergence and random. Removing that setting and replacing with .eval() worked. This was confirmed with several different models across datasets, different pre-processing, fine-tuning on ~1K videos, etc.

# works
device = torch.device("cuda" if use_cuda else "cpu")
syncnet = SyncNet().to(device)
syncnet.eval()

# does not work
device = torch.device("cuda" if use_cuda else "cpu")
syncnet = SyncNet().to(device)
for p in syncnet.parameters():
    p.requires_grad = False

Python 3.7.10 | packaged by conda-forge |
[GCC 9.3.0] on linux
‘1.10.2+cu102’