Forward changes data's device allocation (RuntimeError: tensors must be on device[0])

Rafael_Valle · October 26, 2017, 11:07pm

Has anyone had RuntimeError: all tensors must be on device[0] errors inside a forward pass?

For example, I have this in my training loop:

mixed, clean = batch[0], batch[1]
znoise = znoise.resize_(batch_size, 1024, 8).normal_(0., 1).float()
noise_clean = pnoise.resize_(clean.size()).normal_(0, 1) * input_noise_std
mixed = to_gpu(mixed, inference=False, fp16=fp16)
clean = to_gpu(clean, inference=False, fp16=fp16)
znoise = to_gpu(znoise, inference=False, fp16=fp16)
pnoise = to_gpu(pnoise, inference=False, fp16=fp16)
print("mixed", mixed.get_device()) # outputs 0
print("znoise", znoise.get_device()) # outputs 0

model(mixed, znoise) with forward:

def forward(self, wav, znoise):
    print(wav.get_device(), znoise.get_device(), cuda.current_device()) # outputs 0, 0, 0 then 1,1,1
    # encoder-decoder ladder network with wav and znoise
    # ...
    return output_dec

Executing model(mixed, znoise) leads to RuntimeError: all tensors must be on device[0]

richard · October 26, 2017, 11:53pm

What is your model and how are you running the training loop? It’s a little hard to tell from what you’ve provided where the tensors are being moved to another device.

Rafael_Valle · October 27, 2017, 12:59am

The model is a Generator like DCGAN with a ladder network structure, Convs with PReLU and “Deconvs” with skip connections and PReLU.
Model is instantiated with DataParallel than assigned to the GPU with cuda().
Data batches a reloaded with a DataLoader and at each iteration the training loop is executed.

The forward pass executes with no problem until the return, where it outputs the error message.
I can be more explicit in code if need be.

Rafael_Valle · October 27, 2017, 2:01am

The error happened because I was calling DataParallel on the model twice!

richard · October 27, 2017, 3:13am

Interesting. Glad you figured it out!

Elias_Vansteenkiste · January 16, 2018, 11:26pm

Had the same problem, thank you for figuring it out.
The error message could be a lot clearer.

RuntimeError: all tensors must be on device[0]

should be something like;

RuntimeError: Model already contains a DataParallel submodule

QuantScientist · June 30, 2018, 6:19pm

After hours of debugging and attempting every possible solution out there, I was also calling DataParallel twice.
Thanks