How to compute paired-wise cosine distance using nn.CosineSimilarity

I am using toch.bmm to compute the paired-wise cosine distance between BxDxN and BxDxN. It will return a matrix size of NxN instead of a triangle vector in the matrix in the nn.CosineSimilarity. How to use nn.CosineSimilarity to get full cosine matrix as torch.bmm did? I cannot use torch.bmm because of CUDA memory error. This is my code.

input1 = torch.randn(2, 4, 4)
input2 = torch.randn(2, 4, 4)
#Using bmm
x_norm = input1 / torch.norm(input1, p=2, dim=1, keepdim=True)
y_norm = input2 / torch.norm(input2, p=2, dim=1, keepdim=True)
cosine_sim = torch.bmm(x_norm.transpose(2,1), y_norm)
print('Using bmm: \n', cosine_sim)
# Pytorch CosineSimilarity
cos = nn.CosineSimilarity(dim=1, eps=1e-6)
cosine_sim = cos(input1, input2)
print('Using nn.CosineSimilarity: \n', cosine_sim)

The output is

Using bmm: 
 tensor([[[-0.0230,  0.2983,  0.0487,  0.3974],
         [-0.5747,  0.5513, -0.6436, -0.1389],
         [-0.3876, -0.2107,  0.7093, -0.4929],
         [-0.3446, -0.5347,  0.6372, -0.6423]],

        [[-0.3842, -0.0349,  0.1621,  0.6400],
         [ 0.6776, -0.4812, -0.3169, -0.7976],
         [-0.5251, -0.1258,  0.9381, -0.2379],
         [-0.1517,  0.7164,  0.8332,  0.1668]]])
Using nn: 
 tensor([[-0.0230,  0.5513,  0.7093, -0.6423],
        [-0.3842, -0.4812,  0.9381,  0.1668]])