Model.eval() gives different results every time

assyl_coop · January 17, 2019, 4:15am

My model looks like this below:

class GRU(nn.Module):

    def __init__(self, vocab_size, emb_size, hidden_size, num_classes):
        super(GRU, self).__init__()

        self.encoder = nn.Embedding(vocab_size, emb_size)
        self.drop = nn.Dropout(p=0.8)
        self.gru = nn.GRU(emb_size, hidden_size, dropout=0.8, batch_first=True)
        self.bn = nn.BatchNorm1d(hidden_size)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        x = self.encoder(x)
        x = self.drop(x)
        x, _ = self.gru(x)
        scores = x.matmul(x.transpose(1, 2))
        scores = F.softmax(scores, dim=1)
        x = scores.matmul(x).sum(1)
        x = self.bn(x)
        x = self.fc(x)
        return x

I have loaded the model

model = GRU(vocab_size, embedded_size, hidden_size, num_classes)
model.load_state_dict(torch.load(PATH))
model = model.to(device)
model.eval()

Also there is my accuracy test function.

def compute_accuracy(model, data_loader):
    correct_pred, num_examples = 0, 0
    for i, (features, targets) in enumerate(data_loader):

        features = features.to(device)
        targets = targets.to(device)

        logits = model.eval()(features)
        y_pred = logits.max(1)[1]

        num_examples += targets.size(0)
        correct_pred += (y_pred == targets).sum()
    return correct_pred.float() / num_examples * 100

print(compute_accuracy(model.eval(), valid_loader))
But every time I test model in my dataset, results are different.
am I doing something wrong or .eval() mode not reliable?

ptrblck · January 22, 2019, 4:10am

This shouldn’t be the case and I can’t see anything obviously wrong.
One side note: could you call model.eval() once at the beginning of your compute_accuracy method and just pass the features as logits = model(features).

Also, if it’s possible, could you share the state_dict as well as the values for the hyperparameters of your model?
I just tried random values and got the same output after setting model.eval().

assyl_coop · January 22, 2019, 7:24am

So my hyperparams are:

vocab_size = 33988
embedded_size = 500
hidden_size = 300
num_classes = 363

Modified my compute_accuracy, Results are still different each time.

def compute_accuracy(model, data_loader, train=False, validation=False):
    correct_pred, num_examples = 0, 0
    model.eval()
    for i, (features, targets) in enumerate(data_loader):

        features = features.to(device)
        targets = targets.to(device)

        logits = model(features)
        y_pred = logits.max(1)[1]

        num_examples += targets.size(0)
        correct_pred += (y_pred == targets).sum()
    return correct_pred.float() / num_examples * 100

state_dict of the model is too big to post here

ptrblck · January 22, 2019, 7:47am

Thanks for the values!
I couldn’t reproduce this issue using some random inputs:

x = torch.empty(1000, 10, dtype=torch.long).random_(vocab_size)
y = torch.empty(1000, dtype=torch.long).random_(num_classes)
dataset = TensorDataset(x, y)
loader = DataLoader(
    dataset,
    batch_size=1,
    num_workers=1,
    shuffle=False
)

model.eval()

correct1 = 0.0
nb_samples = 0
with torch.no_grad():
    for data, target in loader:
        output = model(data)
        pred = output.argmax(1)
        correct1 += (pred==target).float().sum()
        nb_samples += data.size(0)
acc1 = correct1 / nb_samples

correct2 = 0.0
nb_samples = 0
with torch.no_grad():
    for data, target in loader:
        output = model(data)
        pred = output.argmax(1)
        correct2 += (pred==target).float().sum()
        nb_samples += data.size(0)
acc2 = correct2 / nb_samples

print(correct1, acc1)
print(correct2, acc2)

I noticed one possible issue: Could you cast your comparison to float before summing it:

correct_pred += (y_pred == targets).float().sum()

Depending on the batch size you are using, you might encounter an overflow here.

assyl_coop · January 22, 2019, 9:06am

def compute_accuracy(model, data_loader, train=False, validation=False):
    correct_pred, num_examples = 0, 0
    model.eval()
    for i, (features, targets) in enumerate(data_loader):

        features = features.to(device)
        targets = targets.to(device)

        logits = model(features)
        y_pred = logits.max(1)[1]

        num_examples += targets.size(0)
        correct_pred += (y_pred == targets).float().sum()
    return correct_pred.float() / num_examples * 100

results are:
first:

tensor(81.4359, device='cuda:0')
tensor(75.7375, device='cuda:0'

second time:

tensor(81.4284, device='cuda:0')
tensor(75.7506, device='cuda:0')

ptrblck · January 22, 2019, 9:09am

Are these numbers you are printing correct_pred and (correct_pred / num_examples * 100)?

assyl_coop · January 22, 2019, 9:11am

correct_pred.float() / num_examples * 100

Yes

ptrblck · January 22, 2019, 3:54pm

Why is correct_pred not an integer (not the dtype but the value), if you just sum (y_pred == targets)?

assyl_coop · January 23, 2019, 3:38am

return correct_pred / num_examples * 100

tried, still results are diff.

tensor(81.4359, device='cuda:0')
tensor(75.7331, device='cuda:0')

and

tensor(81.4478, device='cuda:0')
tensor(75.7375, device='cuda:0'

rasbt · January 23, 2019, 5:25am

For debugging purposes, maybe also try

(y_pred.long() == targets.long()).sum()

(but I guess if they were not longs before you would get an error).

Another thing is, I am not exactly sure what your training loader does. Usually, shuffling wouldn’t affect the accuracy because you iterate through all examples anyway and don’t update the weights. However, since you have some hidden state, maybe there’s some shuffling in the dataset that may affect it (if you are constructing sentences from texts, for example). So maybe try to turn the shuffling off in the dataset loader if you haven’t done so and see if that has some effect.

assyl_coop · January 25, 2019, 3:57am

Thnak you!
It worked. I just changed shuffle to False.

saikrishnarallabandi · April 8, 2020, 7:21pm

@rasbt

sorry I dont get it. can you please expand?

the GRU hidden will not get updated during eval mode correct? why would shuffling make a difference in that case?

rasbt · April 8, 2020, 7:53pm

Yeah, but it is still sequence data. E.g., a text with a random sentence order would not make so much sense as having he sentences in the correct order.

Joe_Renner · September 12, 2021, 9:15am

Hi,

I’m coming in a bit late here, but I’m running into a problem related to this. I am using GRU layers for a sequence classification problem where I take the last hidden state as input to dense layers.
I get different outputs for the same data and same model weights if I either set the model to .eval() or shuffle the data. I was under the impression the GRU hidden state is set to zero for all sequences in a batch (regardless if in eval or not), so why would shuffling the data matter? In other words, is it not the case that every example in a batch is independent?

Thanks

Brinda · July 21, 2022, 5:38pm

Hi. I am facing the exact same issue of getting different accuracy results from model.eval(). Would you mind sharing where you turned off Shuffle argument? I have 2 dataloaders: train_loader which has batches of training data and test_loader which has batches of testing data.