Best practices for model performance assessment

Let’s say we have two deep learning methods/models (trained in a stochastic fashion) and one dataset that we want to assess the performance of both methods to decide which performs the best.

Since the optimization of the models is stochastic, using the results achieved given one experiment (training just once) each model is not reliable. In this sense, each model should be trained n times, and then the results can be assessed statistically.

In this sense:

  1. What are the best practices that I should follow when performing this kind of comparison? Does the community follow any standard? For example, how many times do people train the model, i.e. what is a proper value to n?
  2. Given multiple results of the same model on a given dataset, which metrics are commonly used? Mean, standard deviation, margin of error/confidence interval 95% (or statistical significance of 5%)?
  3. I found in some papers that the results presented in the tables are followed by the ± sign, but they don’t describe it in the text. Does it represent the confidence interval of 95%? i.e., mean ± margin of error

I’m not sure if this type of question is adequate for this forum specifically, but this is the only forum I read/use outside papers. I would appreciate the recommendation if you guys are aware of any other community that this would be more adequate.

CC @rwightman who has been training lots of state-of-the-art CNNs in his timm repository and might want to share his standard approach of evaluating the model accuracy.