I have a network (can be VGG, Resnet, Densenet) with its head/final layer split into two sibling layers. Both the layers are of size equal to number of classes. One layer outputs logits (before softmax) while the other one outputs noise for each class. In simple terms, my loss function is a cross entropy over element-wise sum of both the layers.
For this, should I be extending autograd and implement both forward and backward separately for each sibling layer? or something else can be done?