class GRU(nn.Module):
def __init__(self, vocab_size, emb_size, hidden_size, num_classes):
super(GRU, self).__init__()
self.encoder = nn.Embedding(vocab_size, emb_size)
self.drop = nn.Dropout(p=0.8)
self.gru = nn.GRU(emb_size, hidden_size, dropout=0.8, batch_first=True)
self.bn = nn.BatchNorm1d(hidden_size)
self.fc = nn.Linear(hidden_size, num_classes)
def forward(self, x):
x = self.encoder(x)
x = self.drop(x)
x, _ = self.gru(x)
scores = x.matmul(x.transpose(1, 2))
scores = F.softmax(scores, dim=1)
x = scores.matmul(x).sum(1)
x = self.bn(x)
x = self.fc(x)
return x
I have loaded the model
model = GRU(vocab_size, embedded_size, hidden_size, num_classes)
model.load_state_dict(torch.load(PATH))
model = model.to(device)
model.eval()
Also there is my accuracy test function.
def compute_accuracy(model, data_loader):
correct_pred, num_examples = 0, 0
for i, (features, targets) in enumerate(data_loader):
features = features.to(device)
targets = targets.to(device)
logits = model.eval()(features)
y_pred = logits.max(1)[1]
num_examples += targets.size(0)
correct_pred += (y_pred == targets).sum()
return correct_pred.float() / num_examples * 100
print(compute_accuracy(model.eval(), valid_loader))
But every time I test model in my dataset, results are different.
am I doing something wrong or .eval() mode not reliable?
This shouldn’t be the case and I can’t see anything obviously wrong.
One side note: could you call model.eval() once at the beginning of your compute_accuracy method and just pass the features as logits = model(features).
Also, if it’s possible, could you share the state_dict as well as the values for the hyperparameters of your model?
I just tried random values and got the same output after setting model.eval().
(but I guess if they were not longs before you would get an error).
Another thing is, I am not exactly sure what your training loader does. Usually, shuffling wouldn’t affect the accuracy because you iterate through all examples anyway and don’t update the weights. However, since you have some hidden state, maybe there’s some shuffling in the dataset that may affect it (if you are constructing sentences from texts, for example). So maybe try to turn the shuffling off in the dataset loader if you haven’t done so and see if that has some effect.
Yeah, but it is still sequence data. E.g., a text with a random sentence order would not make so much sense as having he sentences in the correct order.
I’m coming in a bit late here, but I’m running into a problem related to this. I am using GRU layers for a sequence classification problem where I take the last hidden state as input to dense layers.
I get different outputs for the same data and same model weights if I either set the model to .eval() or shuffle the data. I was under the impression the GRU hidden state is set to zero for all sequences in a batch (regardless if in eval or not), so why would shuffling the data matter? In other words, is it not the case that every example in a batch is independent?
Hi. I am facing the exact same issue of getting different accuracy results from model.eval(). Would you mind sharing where you turned off Shuffle argument? I have 2 dataloaders: train_loader which has batches of training data and test_loader which has batches of testing data.