Training siamese and triplet networks: stacking vs multi pass

I am starting to explore the topic of siamese and triplet networks, where the same model appears in the loss at least two time. Think about siamese network and data loader that produces two images. Each image should be passed through the same network and loss is e.g. the euclidean distance between embeddings. Initially, I thought that If I have two images I can stack them (to form larger batch) and pass it through a single model and then at the end recover separate outputs. Right now I am not pretty sure if backprop would work correctly since I do not see how my setup knows that there are two models (gradient is definitely different). My question is: In the case of siamese and triplet network do I need to treat pair/triple images separately and pass them 2/3 times through the model or will stacking actually work? Second part: if stacking does not work, then what I would have to modify so that the equations match (it is desirable to pass single batch and then form e.g. triplets after the pass in online semi-hard triplet mining). Tensorflow example does not seem to care that there are 3 networks in triplet problem (

PyTorch keeps track of activations – you can reuse the same network for different inputs and/or losses and things work as expected. For instance, official GAN example does this

One thing to keep in mind is that layer.weight.grad accumulates gradients it has seen, but this is normally what you want. You can use hooks to print what autograd sees during interleaved forward/backward passes – example colab