Synthetic Image Model Too Accurate?

I’m doing a project for identifying grids with either “X”, “G”, “O” or “ “ letters in it. During the first stage of this research, I am doing a simple CNN to identify the letters that may appear in each cell.

I obtain a val_acc = 99.7% and a test_val of = 99.7% aprox too.

For the dataset, I’m using a synthetic dataset I’ve created that generates letters using different fonts (up to 7 No-Tofu fonts). The dataset has over 4000 grids and, when preprocessed, it has 190000 cells. All images are generated with blur, rotation, etc etc etc

My question here is: Is the model overfitted or is just good at it?

Although I preprocess my dataset and then I split it into train, val and test (I know this is not recommended), I just transform into tensors and normalise (and no labels are used here) so theoricatelly no data leakage should be ocurring here.

Knowing this, is the model overfitted/having data leakage/other things?

Hi Javier!

If your test dataset is truly independent of your train dataset, then your results are solid.

Without “data leakage,” training on just your train dataset can’t overfit your test dataset.

I haven’t done the specific experiment, but I think classifying four different characters should

be a pretty simple tasks, so I’m not surprised that you achieve a 99+% accuracy with even a

straightforward, relatively modest CNN.

As an aside, if your preprocessing includes normalizing over the entire unsplit dataset,

then you do technically have some “data leakage,” although I would very much doubt that

this would affect your results. In any event, normalizing your train, validation, and test datasets

separately would be a suitable way to deal with such a possibility.

Best.

K. Frank

1 Like

^ I second this… You’ve described a pretty easy task. So long as you aren’t testing on the data that you used to train the model, you’re results should be correct.

here are some ideas to test your code:

  • use a 5th class in the test set, that has not been seen while training. the test accuracy should be lower. (that will certainly require some code adaptation).
  • randomize the test labels, without randomization the test data (basically, you test with wrong labels), the test accuracy should be very low.
  • same for training. randomize train label without randomization train data (basically, you train on random data), the model should not be able to learn anything.