Performance is good on the validation set, poor on the test set

Hi everyone,

As the technology becomes more and more advanced nowadays, there are a variety of hyperparameter automatic search scripts.

I used these scripts to get the model hyperparameters and while the performance was very good on the validation set, it degraded severely on the test set.

In addition, I have set seed and torch.backends.cudnn.deterministic==True, but the performance is still difficult to replicate using the Optimal hyperparameters。

Some common tricks I also used (EarlyStop, lr_shedule)

What is everyone’s opinion on this phenomenon?