Different seed lead to very different test result on the test set

When I trained my model several times with different seed setting. And I use 70% data for training, 30% data for testing. After training, test on the test set.

But different seed leads to very different test results, like the following:

seed: 41, 93, 142, 194, 245
test: 95%, 90%, 96.67%, 95%, 93.33%

The biggest difference is 6.67 %.

Can anyone help me and explain why? I’m very confused Because I think different seed should get similar results when using the same training data and test data.

I used this function for seed setting and I used CPU only:

def setup_seed(seed):

torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Thanks in advance.

A high variance in the final model performance points to an unstable training routine, which might come from different data splits, the model architecture, model initialization etc.

You could try to stabilize the training for different seeds by e.g. changing the parameter initialization or making sure the data split is stratified.
You should not pick the best result based on a specific seed. If you can’t stabilize the training, you could calculate the mean and stddev of the accuracy and report this instead.

1 Like

Thank you very much! I will try your advice later.

why shouldn’t we pick the best result (accuracy)? In my opinion, all of the results are within a high-dimensional loss landscape of the model, so the other sub-optimal results (for other seeds) are just the case that the optimization stucks at higher local minima while the optimal result falls into a lower minima. Therefore, can we say that the capacity and generalization of the model emperically achieve the highest accuracy (96.67%), and thus, we can report the best result instead of mean and stddev?

Given that only a training and validation split is used (which is named test set) without a proper test set, I claim it’s misleading to select the best performing validation performance and claim this to be your model performance. You could of course still stick to this approach and test it on the final test set (using new and unseen samples) without the opportunity to pick another model even if the test set performs worse. If you would now pick the best test set performance you are again leaking data into the training process and would claim a misleading performance of your training.

2 Likes

Actually, the answer is mixed. At the beginning of the DL era, CNN papers often reported only a single run due to computational constraints (because training a large ImageNet model could take weeks or even months, so repeating across seeds wasn’t feasible). This still continues even today for large language models.

For small datasets and shallow models, different seeds can lead to noticeably different test accuracies. This happens because each seed affects the random initialization and training trajectory, which may land in different local minima of the loss landscape. Some minima generalize better than others, so the test results vary.

For large datasets and deep models, the loss landscape has many good minima. In such cases, hyperparameters like learning rate and batch size usually matter more than the random seed.

Finally, since there is no mention of the model architecture in the question (but, 70:30 split indicates the size of the dataset might be smaller), it is difficult to say whether the 6% variation is expected or indicates instability.

1 Like

If I understand your explaination correctly, a seed that gives higher accuracy in the validation set doesn’t guarantee that it would give higher accuracy in the test set than another seed that has lower accuracy in the validation set?

If I have a configuration trained under 3 different seeds and the validation accuracies are 90%, 90.5%, 92%,92.3%, 93% while another configuration under 3 same seeds give val. acc. in 88%, 93%, 93.5%, 93.7%, 96%, then this case the seeds and hyperparameters contribute quite the same to the validation accuracy. Can we say that the training is instable?

Yes, the configuration that produced high variance in the validation (or test) error (accuracy) is unstable. Ideally, we want low variance. For example, in earlier deep models, such as deep belief networks, researchers experimented with as many as 400 seeds [paper]!

1 Like

That answers my question. Thank you!