Let’s say we have 2 models or same model with various differences (optimizer, layer depth changes etc.), how can we compare those models in a reliable way ?

Additionally when we use different criterion loss functions, how can we choose one and use one of them and be sure about which one is the better than the other ?

Reducing training non-determinism and doing n-fold cross-validation is pretty reliable. But all such stuff is obviously slow…

If you switch a loss function, you postulate a different probability distribution family for predicted values. If that’s your intent, it is possible to compare model log-likelihoods (for non-deep compact models that would be AIC). Simpler approach is just comparing some auxiliary metrics.

As I understand AIC metric gives us which DL model is better at prediction relatively so I can use it on model selection thank you.

I think torch.set_deterministic() and torch.manuel_seed(0) are pretty useful for comparing how changes are affecting to the models. Is it a good way for DL model testing pipeline ?

Use of AIC’s num. of parameters term is questionable with overparametrized DL models. But likelihood part is comparable across models, yes.

Manual seed is good, as it has no computational side effects. Additional measures should be used with care, you could slow down stuff without achieving bit-wise reproducibility. IMO, it is more appropriate for continuous hyperparam optimization, to not confuse some search algorithms, discrete model/parameter choices can be more obvious.