At a technical level, neither loss nor its gradient is defined when margin = 0.
Note that torch.sign (torch.tensor ([0.0])) is zero, so that the
argument to softplus() diverges (and softplus() itself either diverges
or becomes zero, depending on the sign of z)
You might hope that when margin is very small, but not quite zero, you
can define a consistent gradient by computing the gradient for non-zero margin and then taking its limit as margin goes to zero. But, as shown
by your expression for manual_grad, this doesn’t work.
It is true that manual_grad evaluates to zero for margin = 0.0, but, in
isolation, this isn’t meaningful. When margin is very small and negative, manual_grad becomes one (to machine precision), but when margin is
very small and positive, manual_grad underflows to zero. These numerical
computations are telling you that the gradient is discontinuous as a function
of margin when margin is equal to zero, so trying to define the gradient
by taking the limit as margin goes to zero leaves margin undefined.
Autograd reasonably gives you nan for this undefined value. (Simply
asserting that the gradient ought to be zero or constructing some expression
that returns zero doesn’t change that fact that the gradient isn’t well defined
when margin = 0.0.)