Hi, I have two questions about the discriminator.

- Which size is right for the last layer of discriminator?

```
last_fc = nn.Linear(100,2)
or
last_fc = nn.Linear(100,1)
```

- After last_fc, both sigmoid and softmax can be used with no differences?

Hi, I have two questions about the discriminator.

- Which size is right for the last layer of discriminator?

```
last_fc = nn.Linear(100,2)
or
last_fc = nn.Linear(100,1)
```

- After last_fc, both sigmoid and softmax can be used with no differences?

So the two things are mathematically equivalent, but you have to choose a consistent setup. For the equivalence, look here:

```
weight_softmax = torch.randn(2, 100)
bias_softmax = torch.randn(2)
inp = torch.randn(1, 100) / 50
probs_real_1 = torch.softmax(torch.nn.functional.linear(inp, weight_softmax, bias_softmax), dim=1)[:, 0]
probs_real_2 = torch.sigmoid(torch.nn.functional.linear(inp, weight_softmax[:1] - weight_softmax[1:], bias_softmax[:1] - bias_softmax[1:]))
probs_real_1, probs_real_2
```

Of course, you can also do the computation with variables, pencil, and paper to have a proof. The softmax gradient typically has mean zero, so they’re equivalent in computation (and by corollary by gradient).

Given that the relation is linear, one would expect the equivalence to broadly hold. However, things taking norms like spectral normalization or LAMB-style optimizers will work in a (slightly) different way.

Best regards

Thomas