Actually, I found out problem was my custom Siamese net not my loss function. I want to use a pretrained Vgg face model and continuously train on my dataset. My Siamese net likes that:
class SiameseNetwork(nn.Module):
    def __init__(self, vgg_model):
        super(SiameseNetwork, self).__init__()
        self.vgg = vgg_model
    def forward(self,x0,x1):
        out1 = self.vgg(x0)
        out2 = self.vgg(x1)
        return out1, out2
And I found out that nan values come from output of my network, now I’m trying my best to discover why my network gives nan values. I have checked input to find anomaly but nothing abnormally. Can you give me some advice? I will appreciate it.
[[0.9608, 0.9569, 0.9451,  ..., 0.1098, 0.1059, 0.1059],
          [0.9569, 0.9490, 0.9333,  ..., 0.1098, 0.1098, 0.1098],
          [0.9451, 0.9333, 0.9098,  ..., 0.1137, 0.1137, 0.1137],
          ...,
          [0.9529, 0.9529, 0.9490,  ..., 0.4235, 0.4275, 0.4275],
          [0.9490, 0.9490, 0.9529,  ..., 0.4235, 0.4275, 0.4275],
          [0.9490, 0.9490, 0.9529,  ..., 0.4235, 0.4275, 0.4275]],
         [[0.8941, 0.8902, 0.8784,  ..., 0.0902, 0.0863, 0.0863],
          [0.8902, 0.8863, 0.8706,  ..., 0.0902, 0.0863, 0.0863],
          [0.8863, 0.8745, 0.8471,  ..., 0.0941, 0.0902, 0.0902],
          ...,
          [0.9569, 0.9569, 0.9529,  ..., 0.2902, 0.2902, 0.2902],
          [0.9529, 0.9529, 0.9529,  ..., 0.2863, 0.2902, 0.2902],
          [0.9529, 0.9529, 0.9529,  ..., 0.2863, 0.2902, 0.2902]]]],
       device='cuda:0')
----------------------------------------------------------
output: tensor([[-0.0138, -0.0085,  0.0077,  ..., -0.0088, -0.0003, -0.0021],
        [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
        [ 0.0359,  0.0134,  0.0074,  ...,  0.0280,  0.0116,  0.0102],
        ...,
        [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
        [ 0.0099,  0.0099, -0.0156,  ...,  0.0109, -0.0009,  0.0184],
        [-0.0121, -0.0051,  0.0370,  ...,  0.0406,  0.0065, -0.0012]],
       device='cuda:0', dtype=torch.float16, grad_fn=<AddmmBackward>) tensor([[    nan,     nan,     nan,  ...,     nan,     nan,     nan],
        [-0.0158, -0.0266,  0.0180,  ..., -0.0115,  0.0026, -0.0345],
        [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
        ...,
        [ 0.0249,  0.0007, -0.0102,  ..., -0.0132,  0.0214,  0.0118],
        [ 0.0028,  0.0037,  0.0042,  ...,  0.0135,  0.0115, -0.0005],
        [ 0.0220,  0.0144,  0.0100,  ...,  0.0045,  0.0385, -0.0046]],
       device='cuda:0', dtype=torch.float16, grad_fn=<AddmmBackward>)
Additionally, I load Vgg model like that:
from torchsummary import summary
vgg_model = vgg_face_dag('pretrained/vgg_face_dag.pth').to(device)
'''for param in vgg_model.parameters():
    param.requires_grad = False'''
idx = 0
for layer in vgg_model.children():
    idx += 1
    if idx < 34:
        for param in layer.parameters():
            param.requires_grad = False
summary(vgg_model,(3,224,224))