For binary classification, you need only one logit so, a linear layer that maps its input to a single neuron is adequate. Also, you need to put a threshold on the logit output by linear layer. But an activation layer as the last layer is more rational, something like sigmoid.
Thanks! Would you mind giving me an example with code? Iβve never done it.
What loss do you recommend? Iβm currently using CrossEntropy.
I have one concern regarding the use of a single output, Iβm worried that if the model only sees the class Iβm trying to predict, it will predict class 0,2,3,4 as class 1 later on.
If you want to define your model from scratch, use this tutorial. But based on your sample code, it seems you are using transfer learning. Here is the tutorial for transfer learning.
Why should model see only class 1 samples? In the first post, you have mentioned that you are using two separate folders, one for class 1 and the latter for class 0, 2, 3, 4. So, the model will see all samples and learns class 1 as 1 and 0, 2, 3, 4 as class not 1 or zero.
You can define binary models to learn each separate class and combine them. This approach called OneVsAll.
Just to clarify something, for a binary-classification problem, you
are best off using the logits that come out of a final Linear layer,
with no threshold or Sigmoid activation, and feed them into BCEWithLogitsLoss. (Using Sigmoid and BCELoss is less
numerically stable.)
And, as Doosti recommended, your last layer should have a single
output, rather than 2. Thus:
The short answer is that you threshold your single logit output
against 0.0, rather than running a set of nClass outputs through argmax().
Let me confirm what I think you are asking:
In addition to calculating your loss function (used for training), you
often also want to calculate the accuracy of your predictions.
For a multi-class classification problem, you typically pass a set
of nClass predicted logits (or predicted probabilities) though argmax() to get the single predicted integer class label (that
you then compare with your kown class label). I assume that
this is the βargmaxβ you are talking about.
For a binary problem, your last Linear layer will output a single
predicted logit for the sample being in class-β1β (as opposed to
being in class-β0β). (Or, if you pass this logit through a sigmoid(),
you will get the predicted probability of the sample being in class-β1β.)
In this case you threshold the output to get a binary prediction: logit > 0.0 == True means you predict that the sample is
in class-β1β (and logit > 0.0 == False means class-β0β). (If
you are working with probabilities, then prob > 0.5 == True
means class-β1β.) You then compare this prediction with the known
class-β0β / class-β1β binary label for the sample in question.