Has anyone had RuntimeError: all tensors must be on device errors inside a forward pass?
For example, I have this in my training loop:
mixed, clean = batch, batch
znoise = znoise.resize_(batch_size, 1024, 8).normal_(0., 1).float()
noise_clean = pnoise.resize_(clean.size()).normal_(0, 1) * input_noise_std
mixed = to_gpu(mixed, inference=False, fp16=fp16)
clean = to_gpu(clean, inference=False, fp16=fp16)
znoise = to_gpu(znoise, inference=False, fp16=fp16)
pnoise = to_gpu(pnoise, inference=False, fp16=fp16)
print("mixed", mixed.get_device()) # outputs 0
print("znoise", znoise.get_device()) # outputs 0
model(mixed, znoise) with forward:
def forward(self, wav, znoise):
print(wav.get_device(), znoise.get_device(), cuda.current_device()) # outputs 0, 0, 0 then 1,1,1
# encoder-decoder ladder network with wav and znoise
Executing model(mixed, znoise) leads to RuntimeError: all tensors must be on device
What is your model and how are you running the training loop? It’s a little hard to tell from what you’ve provided where the tensors are being moved to another device.
The model is a Generator like DCGAN with a ladder network structure, Convs with PReLU and “Deconvs” with skip connections and PReLU.
Model is instantiated with DataParallel than assigned to the GPU with cuda().
Data batches a reloaded with a DataLoader and at each iteration the training loop is executed.
The forward pass executes with no problem until the return, where it outputs the error message.
I can be more explicit in code if need be.
The error happened because I was calling DataParallel on the model twice!
Interesting. Glad you figured it out!
Had the same problem, thank you for figuring it out.
The error message could be a lot clearer.
RuntimeError: all tensors must be on device
should be something like;
RuntimeError: Model already contains a DataParallel submodule
After hours of debugging and attempting every possible solution out there, I was also calling DataParallel twice.