Alternatively, you could add the depth information as a fourth channel and edit the first layer of resnet18 so that it takes 4 input channels instead of three.
to steal from @ptrblck’s nice example:
x_image = Variable(torch.randn(1, 3, 224, 224))
x_depth = Variable(torch.randn(1, 1, 224, 224))
input = torch.cat(x_image, x_depth, dim=1) # RGBD input
model = resnet18()
model.conv1 = nn.Conv2d(4, 64, kernel_size=7, stride=2, padding=3, bias=False)
output = model(input)
I’m not sure which would work better for your purposes but this saves a lot of parameters vs the siamese method.