Hello.
I am trying to understand a weird phenomenon and I really hope you can help me, as it seems to underpin something quite fundamental that I am missing here.
I have a batch of images. The first k
are passed through an embedding network, the other k*n are passed through the same network and also through an RNN network which outputs k hidden states(I believe the details are not important). I then input the k
hidden states and the k
embedding from the first group to a relation networks which compares them.
Everything else equal, I noticed that the order of embedding at the beginning of the training step is important in such a way that, in one case, the gradient rapidly goes to zero and the network doesn’t learn anything, and in the other it learns rapidly.
x
is the batch of images.
This is the one that doesn’t work.
A = x[:k]
B = x[k:]
emb_A = model.image_embedding(A)
emb_B = model.image_embedding(B)
...
I have also tried to use two different networks (with same architecture) for emb_A
and emb_B
, with the same unsatisfying results.
this is the one that works:
emb_all = model.image_embedding(x)
emb_A = emb_all[:k]
emb_B = emb_all[k:]
...
I don’t understand why it should matter whether I separated the images before or after doing the embedding. I do get a hint that by reusing the same network on the two groups the gradient may have trouble converging - but as I said the same thing happens when 2 networks are used, which makes me believe that its due to the slicing operation.
Does anyone have any idea?
Thanks