Hi,
Recently I have found a bizarre phenomenon in my model. Here is the thing:
I chained two models M1 and M2.
M2 takes the output of M1 as the input.
However, after loss backward, I found the norm of the gradient in M2 is really small while in M1 this does not happen.
There is one thing special in this mode: M1 and M2 are allocated on two GPUs. The forward function is like this:
Output_M1 = M1(input.to(GPU_M1))
Output_M2 = M2(Output_M1.to(GPU_M2))
I have also tested that if we forward like:
output = M2(input.to(GPU_M1).to(GPU_M2))
The gradient in M2 is also small. However, if we directly allocate the input on M2’s GPU, this does not happen.
In this case, I am wondering if there are any possible mistakes that I have made in my model?