Hi,
I’m currently working on a semantic segmentation problem where I want to classify every pixel in my input image (256X256) to one of 256 classes. I currently use the CrossEntropyLoss and it works OK.

in my specific problem, the 0-255 class numbers also have the property that mistaking between 5 and 6, for instance, is not as “bad” as mistaking 5 and 200. meaning, mistaking “close” classes is not as bad as mistaking “far” classes. thus, I thought of adding a second loss to my system, an L2 loss, meaning MSEloss.

however, the output of my network for input\label of size (Batch X 256 X 256) is (Batch X 256 X 256 X 256), and so I can’t use MSEloss(out, label) directly. also, as I understand it, taking the argmax on the first dim and then using the MSEloss will render it undifferentiable.

is there any way to get around it?
thanks in advance…

One way of incorporating an underlying metric into the distance of probability measures is to use the Wasserstein distance as the loss - cross entropy loss is the KL divergence - not quite a distance but almost - between the prediction probabilities and the (one-hot distribution given by the labels) A pytorch implementation and a link to Frogner et al’s paper is linked below.

An alternative could be to use the expected square error loss per pixel. EDIT: The notebook doean’t solve the per pixel problem. For that you might use a Rubner style (http://robotics.stanford.edu/~rubner/papers/rubnerIjcv00.pdf) approach to have label as a third dimension, or you might just use MSE…
Best regards