I’m doing a project for identifying grids with either “X”, “G”, “O” or “ “ letters in it. During the first stage of this research, I am doing a simple CNN to identify the letters that may appear in each cell.
I obtain a val_acc = 99.7% and a test_val of = 99.7% aprox too.
For the dataset, I’m using a synthetic dataset I’ve created that generates letters using different fonts (up to 7 No-Tofu fonts). The dataset has over 4000 grids and, when preprocessed, it has 190000 cells. All images are generated with blur, rotation, etc etc etc
My question here is: Is the model overfitted or is just good at it?
Although I preprocess my dataset and then I split it into train, val and test (I know this is not recommended), I just transform into tensors and normalise (and no labels are used here) so theoricatelly no data leakage should be ocurring here.
Knowing this, is the model overfitted/having data leakage/other things?
^ I second this… You’ve described a pretty easy task. So long as you aren’t testing on the data that you used to train the model, you’re results should be correct.
use a 5th class in the test set, that has not been seen while training. the test accuracy should be lower. (that will certainly require some code adaptation).
randomize the test labels, without randomization the test data (basically, you test with wrong labels), the test accuracy should be very low.
same for training. randomize train label without randomization train data (basically, you train on random data), the model should not be able to learn anything.