Common data splitting pattern on pytorch examples

I was wondering if anyone else has noticed this but most examples on the pytorch repo only split the data two ways train/test and they use the test set to evaluate model performance after every epoch, isn’t that just plain wrong? Even splitting three ways train/valid/test with datasets of size similar to mnist and cifar I’m still concerned whether that’s enough? Any thoughts on the above issues or am I dead wrong on this?

I think it’s just a standard convention that’s been adopted in machine learning. The idea is to use the training dataset to fine tune your model, a validation dataset to assess and reassess the performance of the trained model for over-fitting purposes, and the test dataset is used to test the completed model.
The split should be as uniform as possible to not influence any sort of bias within the data.

Sure thing, I agree that it’s definitely a convention in ML to have train/valid/test split. That’s fine, what I’m arguing is that the examples in the pytorch repo omit completely the validation set and only have a train and test set. Not only that but during training they expose the test set to the model to check its generalization capabilities. This from my perspective seems wrong since the test set should be untouched and not exposed to your model but only once in the end to get the overall performance.

The other thing that I’m arguing about is the conventional split train/valid/test which is fine if there is sufficient data, and here’s where the line gets fuzzy. What do we mean with sufficient data especially for deep learning models? Are 50k samples sufficient for a three way spilt?

Oh I believe the test set provided can serve as either a validation or test dataset. I think they exclude the use for a validation dataset just for simplicity in the demo. If you could provide a link or two as reference for when they do this I or someone else could provide a much more direct response.

As far as the second question goes that’s something that’s largely up for debate. I’ve heard a typical split as 80/10/10 for train/validation/test but I’m sure you could find a good resource for the number of data required online.

Thanks for the reply.

That’s completely understandable but think someone who just landed in the pytorch repo examples and doesn’t know much about ML. Supposedly they’ll just run this example which is the simplest and they’ll think they got X% accuracy, but that’s not true since the test set is used for model evaluation during training.
Bottom line is we should enforce good practices at least doing train/valid/test split in order not to mislead by accident.

I’m also seeing the point you mentioned and think the current approach was chosen because it lets you focus on the fine-tuning mechanics. However, as you explained, it might be clear to experienced users that the “real” test dataset is missing, but might confuse newcomers.
Feel free to create an issue to discuss this further.

Thanks just created it! Just to provide another example on this topic. I was just looking for trained models on cifar10 and I landed here. First of all let me say that I’m grateful for ppl like the one I’m referencing for releasing models and weights. Second, it is not the only repo that I’ve seen reporting test accuracy numbers for models trained on cifar10 but if you look closer the link above in the jupyter notebook you’ll see that those test accuracy numbers are not based on the untouched test set but on the actual test set used as validation set in this case similar to the example of mnist in the pytorch examples repo.