Gradient Vanishing in a chained model

Hi,

Recently I have found a bizarre phenomenon in my model. Here is the thing:

I chained two models M1 and M2.
M2 takes the output of M1 as the input.
However, after loss backward, I found the norm of the gradient in M2 is really small while in M1 this does not happen.

There is one thing special in this mode: M1 and M2 are allocated on two GPUs. The forward function is like this:

Output_M1 = M1(input.to(GPU_M1))
Output_M2 = M2(Output_M1.to(GPU_M2))

I have also tested that if we forward like:

output = M2(input.to(GPU_M1).to(GPU_M2))

The gradient in M2 is also small. However, if we directly allocate the input on M2’s GPU, this does not happen.

In this case, I am wondering if there are any possible mistakes that I have made in my model?