Mask R-CNN: bbox and segmentation good, but classification very bad

I trained a model from scratch, no pretrained has been used. The dataset contains 2 classes, each of them has 10k grayscale images, but very small ones. 24x40. I embedded those images into bigger images, which have been randomly created, as well as embedded them into some other bigger grayscale images (512x640) (plus augmentation with affine transformations).
When I test the model with those images, too embedded randomly into other images, the bbox and segmentation is very good. But classification is around 0.5 for both classes, with a tendency towards the correct class.
How can i improve the classification accuracy?

optimizer = torch.optim.SGD(params, lr=0.005, momentum=0.9, weight_decay=0.0005)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)

Or is the object size too small and the model architecture needs to be adapted?

At the moment I only adapted the anchor size to

anchor_sizes = ((8,), (16,), (32,), (64,), (128,), )

Even when I run this 10 epochs up to 10 times, effectifely running 100 epochs, the situation does not change.