Penalizing cosine similarity between kernels

tal123 · February 29, 2024, 7:25pm

I have a fully convolutional network, (like YOLOv3 or SSD).
The last layer uses K 1x1 kernels to produce a tensor of K predictions all over the feature map (i.e., a BxKxHxW tensor).

I want those kernels to be linearly independent, i.e., that there will be no predictor that can be computed as a linear combination of the others.

For that reason, I want the optimizer to be rewarded for learning kernels that will be orthogonal to each other, by adding their pairwise cosine similarity to the loss.

The loss function only receives the network output. How do I add the cosine similarity of the kernels to the loss? Should I implement a special mode where the network will include them in the output of its forward method, so the loss function will have access to it?

Thanks

ptrblck · March 1, 2024, 4:09pm

You don’t need to manipulate the forward method and can add any auxiliary loss to the already computed loss as seen here:

conv = nn.Conv2d(1, 64, 1)
x = torch.randn(1, 1, 24, 24)
target = torch.randint(0, 64, (1, 24, 24))

criterion = nn.CrossEntropyLoss()

output = conv(x)
loss = criterion(output, target)
aux_loss = 1 - F.cosine_similarity(conv.weight[0], conv.weight[1])
loss = loss + aux_loss
loss.backward()

tal123 · March 3, 2024, 10:05am

Thank you so much for the comprehensive answer. I am going to use the auxiliary loss approach.

Just note that in my case, aux_loss will be the cosine similarity itself, or actually its absolute value, rather than 1 - CS. I want the loss to grow as the kernels are getting more similar to each other.

When two vectors have the same direction, their cosine similarity will be 1, while when they are exactly opposite (but still very related, actually a negative scalar multiplication of each other), it will be -1.

>>> c = nn.Conv2d(3, 64, 1) 
>>> F.cosine_similarity(c.weight[0], c.weight[0], dim = 0)
tensor([[1.0000]], grad_fn=<SumBackward0>)
>>> F.cosine_similarity(c.weight[0], -c.weight[0], dim = 0)
tensor([[-1.0000]], grad_fn=<SumBackward1>)

Therefore, the absolute cosine similarity is the loss I’m looking for. In my code for 3 kernels, it will look like this:

conv = nn.Conv2d(3, 64, 1)  # 3 kernels of 1x1 shape, feature dimensionality is 64
aux_loss = 0
# The shape of conv.weight[i] is (64,1,1). Therefore, specifying dim=0
aux_loss += torch.abs(F.cosine_similarity(conv.weight[0], conv.weight[1], dim = 0))
aux_loss += torch.abs(F.cosine_similarity(conv.weight[0], conv.weight[2], dim = 0))
aux_loss += torch.abs(F.cosine_similarity(conv.weight[1], conv.weight[2], dim = 0))

I.e., every pair of 2 kernels is rewarded for being in different directions (striving to being orthogonal).