Classifier bias

Greetings, everyone!
I would like to clarify the case with the image classifier, which I can describe in general, and then I will provide an example in Colab.
I have images of 7 classes that are easily predicted by a simple ResNet classifier with 98+% accuracy. I modified my dataset a bit: I took samples from one class and randomly split them by half - the first half retained the previous class, the other half gained a new class. My goal was to confuse the classifier between two classes in the pair since the sample distribution in both classes was the same. And I expected that the network would classify the samples from similar classes with equal probability: for example, if a single class from the previous setup was guessed with 98% accuracy, then in the new dataset the correct guesses would be 49% vs 49%, or at least without landslide difference.
But classifier preferred to classify samples from both subsets with one class, like 70% vs 30%.
I understand that I deal with bias but I wonder if this situation is normal?
The reason why I was doing so is my experiments with metadata, which I supply to the semantic segmentation network via separate input and this metadata is a hint about the samples domain.
I’m convinced that metadata becomes important when the network cannot guess from the source sample to which domain it belongs. I quantify my assumptions on how easily the network can guess the domain with the ResNet classifier.
And I’m a bit uncomfortable that the classifier prefers one class when it cannot distinguish between two (or more…) classes.
I might be wrong and can reconcile with this bias if somebody from this respectable community can confirm this or point to some published discussion on the topic.

I prepared a test case for illustration:

Here I overrode the CIFAR10 dataloader get_item method and when it takes an image of class 8 (ship), it can randomly make it a new class 11.
Similarly, the sample of class 9 ‘truck’ can be left the same or be assigned a new class 10 ‘truk_m’ (‘modified truck’).
The simple classifier from the PyTorch tutorial after training doesn’t distribute predictions equally between similar classes but usually prefers to predict one class more often (the confusion matrix at the bottom).

i think that it’s better to train model for more epochs. also you can use larger lr.
your loss is still to high, maybe training for a bit more give you a better result,
you can use weight decay to have more generalization.
i found these topics, i hope it helps you.

Expected calibration error (ECE) is a metric that compares neural network model output pseudo-probabilities to model accuracies . ECE values can be used to calibrate (adjust) a neural network model so that output pseudo-probabilities more closely match actual probabilities of a correct prediction.

Implicit model calibration:

By artificially softening the targets, label smoothing prevents the network from becoming overconfident. But does it improve the calibration of the model by making the confidence of its predictions more accurately represent their accuracy? In this section, we seek to answer this question. Guo et al. [15] have shown that modern neural networks are poorly calibrated and over-confident despite having better performance than better calibrated networks from the past. To measure calibration, the authors computed the estimated expected calibration error (ECE). They demonstrated that a simple post-processing step, temperature scaling, can reduce ECE and calibrate the network. Temperature scaling consists in multiplying the logits by a scalar before applying the softmax operator. Here, we show that label smoothing also reduces ECE and can be used to calibrate a network without the need for temperature scaling.

and i think you should check class probability of each sample not class accuracy

Thank you, mMagme, you made good points.
I wonder if network calibration is needed, my initial idea was to use a classifier with standard training settings.
I see it may become overconfident on domains with fuzzy borders due to the random factors.
In my original setup, the probabilities for all other domains remain over 0.9 in softmax output and for artificially split domains 0.5-0.7
Using special techniques for calibration may deviate from my experimentation plan since I use that classifier for illustrative purposes.
Anyway, I will try soft labeling it if it’s easy to apply.