Model.eval() gives different results every time

My model looks like this below:

class GRU(nn.Module):

    def __init__(self, vocab_size, emb_size, hidden_size, num_classes):
        super(GRU, self).__init__()

        self.encoder = nn.Embedding(vocab_size, emb_size)
        self.drop = nn.Dropout(p=0.8)
        self.gru = nn.GRU(emb_size, hidden_size, dropout=0.8, batch_first=True)
        self.bn = nn.BatchNorm1d(hidden_size)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        x = self.encoder(x)
        x = self.drop(x)
        x, _ = self.gru(x)
        scores = x.matmul(x.transpose(1, 2))
        scores = F.softmax(scores, dim=1)
        x = scores.matmul(x).sum(1)
        x = self.bn(x)
        x = self.fc(x)
        return x

I have loaded the model

model = GRU(vocab_size, embedded_size, hidden_size, num_classes)
model.load_state_dict(torch.load(PATH))
model = model.to(device)
model.eval()

Also there is my accuracy test function.

def compute_accuracy(model, data_loader):
    correct_pred, num_examples = 0, 0
    for i, (features, targets) in enumerate(data_loader):

        features = features.to(device)
        targets = targets.to(device)

        logits = model.eval()(features)
        y_pred = logits.max(1)[1]

        num_examples += targets.size(0)
        correct_pred += (y_pred == targets).sum()
    return correct_pred.float() / num_examples * 100

print(compute_accuracy(model.eval(), valid_loader))
But every time I test model in my dataset, results are different.
am I doing something wrong or .eval() mode not reliable?

This shouldn’t be the case and I can’t see anything obviously wrong.
One side note: could you call model.eval() once at the beginning of your compute_accuracy method and just pass the features as logits = model(features).

Also, if it’s possible, could you share the state_dict as well as the values for the hyperparameters of your model?
I just tried random values and got the same output after setting model.eval().

So my hyperparams are:

vocab_size = 33988
embedded_size = 500
hidden_size = 300
num_classes = 363

Modified my compute_accuracy, Results are still different each time.

def compute_accuracy(model, data_loader, train=False, validation=False):
    correct_pred, num_examples = 0, 0
    model.eval()
    for i, (features, targets) in enumerate(data_loader):

        features = features.to(device)
        targets = targets.to(device)

        logits = model(features)
        y_pred = logits.max(1)[1]

        num_examples += targets.size(0)
        correct_pred += (y_pred == targets).sum()
    return correct_pred.float() / num_examples * 100

state_dict of the model is too big to post here

Thanks for the values!
I couldn’t reproduce this issue using some random inputs:

x = torch.empty(1000, 10, dtype=torch.long).random_(vocab_size)
y = torch.empty(1000, dtype=torch.long).random_(num_classes)
dataset = TensorDataset(x, y)
loader = DataLoader(
    dataset,
    batch_size=1,
    num_workers=1,
    shuffle=False
)

model.eval()

correct1 = 0.0
nb_samples = 0
with torch.no_grad():
    for data, target in loader:
        output = model(data)
        pred = output.argmax(1)
        correct1 += (pred==target).float().sum()
        nb_samples += data.size(0)
acc1 = correct1 / nb_samples

correct2 = 0.0
nb_samples = 0
with torch.no_grad():
    for data, target in loader:
        output = model(data)
        pred = output.argmax(1)
        correct2 += (pred==target).float().sum()
        nb_samples += data.size(0)
acc2 = correct2 / nb_samples

print(correct1, acc1)
print(correct2, acc2)

I noticed one possible issue: Could you cast your comparison to float before summing it:

correct_pred += (y_pred == targets).float().sum()

Depending on the batch size you are using, you might encounter an overflow here.

def compute_accuracy(model, data_loader, train=False, validation=False):
    correct_pred, num_examples = 0, 0
    model.eval()
    for i, (features, targets) in enumerate(data_loader):

        features = features.to(device)
        targets = targets.to(device)

        logits = model(features)
        y_pred = logits.max(1)[1]

        num_examples += targets.size(0)
        correct_pred += (y_pred == targets).float().sum()
    return correct_pred.float() / num_examples * 100

results are:
first:

tensor(81.4359, device='cuda:0')
tensor(75.7375, device='cuda:0'

second time:

tensor(81.4284, device='cuda:0')
tensor(75.7506, device='cuda:0')

Are these numbers you are printing correct_pred and (correct_pred / num_examples * 100)?

correct_pred.float() / num_examples * 100

Yes

Why is correct_pred not an integer (not the dtype but the value), if you just sum (y_pred == targets)?

return correct_pred / num_examples * 100

tried, still results are diff.

tensor(81.4359, device='cuda:0')
tensor(75.7331, device='cuda:0')

and

tensor(81.4478, device='cuda:0')
tensor(75.7375, device='cuda:0'

For debugging purposes, maybe also try

(y_pred.long() == targets.long()).sum()

(but I guess if they were not longs before you would get an error).

Another thing is, I am not exactly sure what your training loader does. Usually, shuffling wouldn’t affect the accuracy because you iterate through all examples anyway and don’t update the weights. However, since you have some hidden state, maybe there’s some shuffling in the dataset that may affect it (if you are constructing sentences from texts, for example). So maybe try to turn the shuffling off in the dataset loader if you haven’t done so and see if that has some effect.

Thnak you!
It worked. I just changed shuffle to False.

1 Like

@rasbt

sorry I dont get it. can you please expand?

the GRU hidden will not get updated during eval mode correct? why would shuffling make a difference in that case?

1 Like

Yeah, but it is still sequence data. E.g., a text with a random sentence order would not make so much sense as having he sentences in the correct order.

2 Likes

Hi,

I’m coming in a bit late here, but I’m running into a problem related to this. I am using GRU layers for a sequence classification problem where I take the last hidden state as input to dense layers.
I get different outputs for the same data and same model weights if I either set the model to .eval() or shuffle the data. I was under the impression the GRU hidden state is set to zero for all sequences in a batch (regardless if in eval or not), so why would shuffling the data matter? In other words, is it not the case that every example in a batch is independent?

Thanks

Hi. I am facing the exact same issue of getting different accuracy results from model.eval(). Would you mind sharing where you turned off Shuffle argument? I have 2 dataloaders: train_loader which has batches of training data and test_loader which has batches of testing data.