I have a network (can be VGG, Resnet, Densenet) with its head/final layer split into two sibling layers. Both the layers are of size equal to number of classes. One layer outputs logits (before softmax) while the other one outputs noise for each class. In simple terms, my loss function is a cross entropy over element-wise sum of both the layers.
For this, should I be extending autograd and implement both forward and backward separately for each sibling layer? or something else can be done?
The network produces two different outputs. I do an operation using the two outputs separately, not in the forward pass.
How will that work out with backward then?
just write your loss in terms of autograd operations, and call backward. you dont need to do anything special like writing your own autograd.Function with a custom backward.