0.4.0 arguments are located on different GPUs

Hello Everyone,

I have one model that I want to train on multi-gpus. It works well on 0.3.0. But when I changed to 0.4.0. It has the error message as " arguments are located on different GPUs". More strange thing was that the first iteration worked fine(the code passed all backward(), optimizer.step() sentences) and the error happened when entering the backward the second time.

The code is quite complicated and also cannot be published yet. Anyone has some idea how this error happens based on the limited information?

Thanks in advance

This sounds like a bug. It’s a little hard to help without seeing code, but here are a few things you can try:

  1. Trimming down your example. You can try deleting parts of your model until you arrive at a more minimal example that triggers the error.
  2. Figure out what function actually throws that error. In the backend, checking for arguments on the same GPU gets performed. If you figure out which function it’s in, and which arguments aren’t on the same GPU, you might be able to track down where in your code this happens.