def forward(self, input_image):
# (channels, height, width)
##
out = self.conv_1(input_image) # 20 * 44 * 44
out = self.bn_conv_1(out)
out = self.pooling_1(out) # 20 * 22 * 22
out = self.bn_pooling_1(out)
##
out = self.conv_2(out) # 50 * 16 * 16
out = self.bn_conv_2(out)
out = self.pooling_2(out) # 50 * 8 * 8
out = self.bn_pooling_2(out)
##
out = self.conv_3(out) # 500 * 2 * 2
out = self.bn_conv_3(out)
out = self.pooling_3(out) # 500 * 1 * 1
out = self.bn_pooling_3(out)
out = F.relu(out)
out = self.conv_4(out) # 2 * 1 * 1
out = F.softmax(out)

My doubt is that:

This is a binary classification model, but the output has two nodes.
(Generally, there is only one output node in the binary classification model, and the prediction result is judged by greater than or less than 0.5.)

Although I don’t know if this is a key consideration, there is no fully connected layer in the model.

Considering the above, how should I design the code for the loss function?

For a binary classification use case, you could use a single output and a threshold (as you’ve explained) or alternatively you could use a multi-class classification with just two classes, so that each class gets its output neuron. The loss functions for both approaches would be different.
In the first case (single output), you would use e.g. nn.BCEWithLogitsLoss and the output tensor shape should match the target shape.
In the latter case, you would use e.g. nn.CrossEntropyLoss and the target tensor shape should contain the class indices in the range [0, nb_classes-1] and miss the “class dimension” (usually the channel dim).

Both approaches expect logits, so you should remove your softmax layer and just pass the last output to the criterion.

A final linear layer is not strictly necessary, if you make sure to work with the right shapes of your output and target.

In the latter case, you would use e.g. nn.CrossEntropyLoss and the target tensor shape should contain the class indices in the range [0, nb_classes-1] and miss the “class dimension” (usually the channel dim).

I got it.

Both approaches expect logits, so you should remove your softmax layer and just pass the last output to the criterion.

Thanks for your suggestion, ’so you should remove your softmax layer and just pass the last output to the criterion.', that’s really the point.

In addition to these, there are questions:

Both approaches expect logits, so you should remove your softmax layer and just pass the last output to the criterion.

The input is expected to contain raw, unnormalized scores for each class.

But why is it? In my opinion, the input of CrossEntropyLoss is:
(1) the prediction (the output of model); (2) the label.
And the prediction data should be converted to probability, which means that I need to place a layer of soft-max layer at the end of the model.

Is that my wrong understanding that the input data of CrossEntropyLoss should be probability?
(So that put the model output data to the CrossEntropyLoss directly is correct from mathematical theory?)

Or alternatively the CrossEntropyLoss function in PyTorch will do that (computes the probability of prediction of model) automatically?