Binary classification using vision transformers

Hello everyone.
I have a project on a binary classification using vision transformers. Actually the architecture has a lot of other blocks but the one in interest is the encoder (vision transformer). However, in my dataset, in order to classify images into 0/1, each image can be both so i do not really know how to train the global model.

question: why each image can be classified for both 0/1?
answer: in order to classify the image, I am relying on two inputs, the first one is the image and the second one is some vector which I add to the image. So, each image has both vectors of 0 and 1.

in the training loop I am passing an image and vector of 1 and have a class label (same thing for other class). and then I am summing the two losses *0.5 and backpropagate the global loss. is it correct?