How to judge and reduce overfitting on our datasets?

There are 113 classes in my dataset, and few features can be used in the images, so it is easy to overfit. We use Resnet50 as the backbone network, The best model obtained after 300 epochs has an accuracy of about 63% in the test images set and 99% in the training set.
So here are my questions:
1.How to judge whether it is overfitted? If it is judged according to the accuracy of nearly 100% on the training set, it is obviously ovefitting. If overfitting is defined as ‘the accuracy difference between the training set and the test set is under 20%’, it will exceed this value in the 2nd-3rd epoch. But at the same time, with the increase of training epochs, the accuracy of the test set is still improving, and will eventually reach about 63%. (our training set and test set are completely isolated, and there is no image overlapped in these two datasets)
2.As mentioned in 1, whether the accuracy of the test set in this case can represent the generalization ability in the open world, because our data set is fossil data, the fossils are carbonized and flattened, there is little detail information available in the image, and the sample size is not large. **Is there any overfitting?**In this case, can the criteria for overfitting be changed? Or use other methods to judge whether it’s overfitting or not?
3.How to reduce overfitting? At present, dropout (p=0.5) and weights have been tried_ The decay method is changed to update w only and the parameter is 1e-3. The effect of dropout is slightly better, but the accuracy difference between the training set and the test set is still higher than 20% before the fifth epoch. Next, prepare to try to replace resnet18 as the backbone network.

I would check similarity between training and test using adversarial validation. I don’t know how this affect if inputs are images. See post here https://towardsdatascience.com/how-to-assess-similarity-between-two-datasets-adversarial-validation-246710eba387
Secondly, overfitting is not only about dropout and regularization, there is model complexity issue too!
Lastly, would it be helpful if you have early stopping criterion set for not having difference between train and validation more than 10% (lets say patience=20 epochs)? See cooper statistics if that helps On the development and validation of QSAR models - PubMed

Thank you for your suggestion!
We are going to try the small complexity model. Our dataset will be overfitted after only 1-3 epochs. Would early stopping be meaningful in this case?

Firstly what is the size of test and train dataset.
If number of images are less, you can think of creating multiple patches of images (depend on your problem applicable or not).
You should initially progress with less complex model, something like ResNet18.

Sometimes dataset is too complex, no model will work, you have to look for alternative features in addition to getting features using CNN directly.

Data is sufficient.
You may vary learning rates and see the performance. You can try some pre-processing on image (as per the problem requirement) prior to sending it to CNN model.
Further try dropout etc. generalisation methods, try recent classification models.

@gns24 you save checkpoint, and let model train up to next 20 epochs (as I mentioned previously, patience=20), see if difference between accuracy of training and validation decreases.
see below

# define min_accu_diff , patience
min_accu_diff = float(0.1)
patience = 20

for epoch in range(EPOCHS):
	
	# train model, validate model HERE
	# derive accuracy difference between train & validation "current_accu_diff"

	# here is your manual stopping criterion
	if current_accu_diff < min_accu_diff:
		torch.save(model.state_dict(), 'checkpoint.pt')
		epochs_no_improve = 0
		min_accu_diff=current_accu_diff
	else:
		epochs_no_improve += 1
		print(f'EarlyStopping counter: {epochs_no_improve} out of {patience}')
		
	# Early stopping 
	if epochs_no_improve == patience:
		print(f"Early stopping at Epoch {epoch}\n")
		break;
	else:
		continue;