@albanD, just final clarification, it will also be equivalent to write:
grad_target = (output_cl * label)
grad_target.backward(gradient=label * output_cl, retain_graph=True)
Because doing backward on the sum of output_cl * label
with respect to itself as much as I understand is equivalent to doing backward on each multiplication element-wise, i.e minimizing each multiplication element-wise, which is minimal i.f.f the sum minimal, am I wrong?
Is it equivalent ? If it doesn’t, so what is the meaning of this derivation comparing to the first one ?
Appreciate your answering
Thanks.