I have a fully convolutional network, (like YOLOv3 or SSD).
The last layer uses K 1x1 kernels to produce a tensor of K predictions all over the feature map (i.e., a BxKxHxW tensor).
I want those kernels to be linearly independent, i.e., that there will be no predictor that can be computed as a linear combination of the others.
For that reason, I want the optimizer to be rewarded for learning kernels that will be orthogonal to each other, by adding their pairwise cosine similarity to the loss.
The loss function only receives the network output. How do I add the cosine similarity of the kernels to the loss? Should I implement a special mode where the network will include them in the output of its forward
method, so the loss function will have access to it?
Thanks