What is the right way to train image pairs? I want to train a convolutional network to grade each one of the pairs (100 pairs). Should I use cat((img1, img2),0), cat((img1, img2),1) or cat((img1, img2),2)? Tried 0 but I ended up creating 1x100x2x80x80 (80x80 is the size of the image) and got the error:

ValueError: Expected 4D tensor as input, got 5D tensor instead.

How would you define a layer that receives 5D tensor?
I also tried 1x100x160x80 and 1x100x80x160 but the model didn’t learn anything and I wonder if this was the problem.

I’m not sure what your input sizes are, but you do have a couple of choices:

cat along channel dimension

stack along H or W dimensions

For example, if your images have size (300, 500) and are tensors of size (1, 3, 300, 500) (this is NCHW), you could do torch.cat([tensor1, tensor2], 1) to cat along the channel dimension.

You could also do torch.stack([tensor1, tensor2], 3) to create a new tensor of size (1, 3, 600, 500) that is literally the two images side by side.

I have 100 pairs of 80x80 images. I tried cat 1 and 2 and unfortunately the network didn’t learn anything. When I tried the same thing with only 10 images it did learn, I wondered if the problem is the cat or something else.

Not sure what you are trying to achieve here. In my case, I create two batches, one for the left images, one for the right ones, concatenate along batch dimension, then merge pairs of embedding vectors:

batch: [im_1_l, im_2_l, im_3_l] , [im_1_r, im_2_r, im_3_r] (3 x C x H x W) x 2

concatenate: [im_1_l, im_2_l, im_3_l, im_1_r, im_2_r, im_3_r] (6 x C x H x W)

process each image with CNN: [emb_1_l, emb_2_l, emb_3_l, emb_1_r, emb_2_r, emb_3_r] (6 x E)

un-concatenate: [[emb_1_l, emb_1_r], [emb_2_l, emb_2_r], [emb_3_l, emb_3_r]] (3 x 2 x E)