Hello @KFrank!
z
is the binary representation of the class with m
bits. As an example, for m=3
z
runs from 000
to 111
. So z
has a particular structure in that it is the exhaustive list of combinations of 1
s and 0
s.
I have two questions here. One, for this to work should the activation function at the end of my network be nn.Sigmoid
(I checked the documentation and found that nn.Sigmoid
is added internally in the loss function)? And two, is this loss function optimal for the setting? What I mean is does it lead to a trained network, which can perform as well as ||y-ax||^2
being the loss function? I apologize if my questions do not make sense.
The relation y=ax+b
can produce infinite number of training samples, theoretically. This is because the there are infinite values of a
and b
for a fixed distribution with fixed statistics.
Thank you so much Frank, for taking time to make me understand the working of the loss functions. I feel I am really close to the correct architecture for this problem. Appreciate it.