How to compare the dis-/ similarity between images?

I have a data set which looks like following:

Image_1    Image_2    Image_3
  A           B         C
 ...         ...       ...

If image A is similar to B, it will be assigned with label 1, otherwise with label 0.

I first use pre-trained resnet18 to extract features for each RGB image and get a 1000 dimensional vector. Then I build a deep network and use triplet as loss function to train model, there is a part of my code:

class Network(torch.nn.Module):
    def __init__(self, n_feature = 1000, n_hidden_1,n_output = 10):
        super(Network, self).__init__() = torch.nn.Sequential(
                        torch.nn.Linear(n_feature, n_hidden_1),
                        torch.nn.Linear(n_hidden_1, n_output)
    def forward(self, x):
        x =
        return x

Training step:

for step, (batch_anchor, batch_positive, batch_negative )in enumerate(train_loader):
    anchor_out = model(batch_anchor)
    positive_out = model(batch_positive)
    negative_out = model(batch_negative)
    loss = loss_func(anchor_out, positive_out, negative_out)

where I define loss function and optimiser with:

optimizer = optim.Adam(model.parameters(), lr=0.002)
loss_func = torch.nn.TripletMarginLoss()

After the training process is done, I test this network with validation set:

with torch.no_grad():
    anchor_out_val = model(val_data_anchor).numpy()
    positive_out_val = model(val_data_positive).numpy()
    negative_out_val = model(val_data_negative).numpy()

Now I use L2 Norm to measure similarity and assign labels, this works very well in validation set and I got accuracy 80% measured by accuracy_score from sklearn. But when I try it with test set, I only get 50% accuracy. Maybe someone could tell me why? Is the metric to measure similarity not good? or maybe it is the problem from network?

If I understand the issue correctly, you are seeing a good training and validation accuracy, but the test accuracy is a lot worse.
Could your training procedure have leaked the validation data somehow into the training, so that its accuracy might be biased?
Also, are you processing the training and validation data in a different way compared to the processing of the test set? Are all dataset splits generated randomly from the same data pool or were they sampled differently?