Dataparallel, training with multiple gpus

Hi! So I do this:

net = torch.nn.DataParallel(AutoEncoder(config), device_ids = list_of_gpus)

But then when I print: for elem in net.parameters(): print(elem.device)

I get a list of “cpu” for most of hte parameters, and at the end I get cuda:0. Why is this the case? Thanks

Could you post your model definition, so that we could have a look at it, please?

1 Like

I’m using this open-source code:

Except I don’t want to use just one gpu, so I modified up until line 78 in the file linked above to be:

config = setup(default_config="…/configs/violin.yaml")




list_of_gpus = [int(s) for s in config.gpu_list.split(’,’)]


net = AutoEncoder(config)

net = torch.nn.DataParallel(net, device_ids = list_of_gpus)

Thanks for the link. Which parameters are still on the CPU after pushing the model to the GPU?

Whats the best way for me to figure that out?

You can print then via:

for name, param in model.named_parameters():
    print(name, param.device)

It seems like everything is on cuda:0, but the error says something is on cuda:1 :

Printed from code above:

module.decoder.mlp_f0.mlp_layer1.0.weight cuda:0
module.decoder.mlp_f0.mlp_layer1.0.bias cuda:0
module.decoder.mlp_f0.mlp_layer1.1.weight cuda:0
module.decoder.mlp_f0.mlp_layer1.1.bias cuda:0
module.decoder.mlp_f0.mlp_layer2.0.weight cuda:0
module.decoder.mlp_f0.mlp_layer2.0.bias cuda:0
module.decoder.mlp_f0.mlp_layer2.1.weight cuda:0
module.decoder.mlp_f0.mlp_layer2.1.bias cuda:0
module.decoder.mlp_f0.mlp_layer3.0.weight cuda:0
module.decoder.mlp_f0.mlp_layer3.0.bias cuda:0
module.decoder.mlp_f0.mlp_layer3.1.weight cuda:0
module.decoder.mlp_f0.mlp_layer3.1.bias cuda:0
module.decoder.mlp_loudness.mlp_layer1.0.weight cuda:0
module.decoder.mlp_loudness.mlp_layer1.0.bias cuda:0
module.decoder.mlp_loudness.mlp_layer1.1.weight cuda:0
module.decoder.mlp_loudness.mlp_layer1.1.bias cuda:0
module.decoder.mlp_loudness.mlp_layer2.0.weight cuda:0
module.decoder.mlp_loudness.mlp_layer2.0.bias cuda:0
module.decoder.mlp_loudness.mlp_layer2.1.weight cuda:0
module.decoder.mlp_loudness.mlp_layer2.1.bias cuda:0
module.decoder.mlp_loudness.mlp_layer3.0.weight cuda:0
module.decoder.mlp_loudness.mlp_layer3.0.bias cuda:0
module.decoder.mlp_loudness.mlp_layer3.1.weight cuda:0
module.decoder.mlp_loudness.mlp_layer3.1.bias cuda:0
module.decoder.gru.weight_ih_l0 cuda:0
module.decoder.gru.weight_hh_l0 cuda:0
module.decoder.gru.bias_ih_l0 cuda:0
module.decoder.gru.bias_hh_l0 cuda:0
module.decoder.mlp_gru.mlp_layer1.0.weight cuda:0
module.decoder.mlp_gru.mlp_layer1.0.bias cuda:0
module.decoder.mlp_gru.mlp_layer1.1.weight cuda:0
module.decoder.mlp_gru.mlp_layer1.1.bias cuda:0
module.decoder.mlp_gru.mlp_layer2.0.weight cuda:0
module.decoder.mlp_gru.mlp_layer2.0.bias cuda:0
module.decoder.mlp_gru.mlp_layer2.1.weight cuda:0
module.decoder.mlp_gru.mlp_layer2.1.bias cuda:0
module.decoder.mlp_gru.mlp_layer3.0.weight cuda:0
module.decoder.mlp_gru.mlp_layer3.0.bias cuda:0
module.decoder.mlp_gru.mlp_layer3.1.weight cuda:0
module.decoder.mlp_gru.mlp_layer3.1.bias cuda:0
module.decoder.dense_harmonic.weight cuda:0
module.decoder.dense_harmonic.bias cuda:0
module.decoder.dense_filter.weight cuda:0
module.decoder.dense_filter.bias cuda:0
module.reverb.fir cuda:0
module.reverb.drywet cuda:0
module.reverb.decay cuda:0


 File "/ddsp-pytorch/train/trainer/", line 246, in train
    phase, num_steps=num_steps
  File "/ddsp-pytorch/train/trainer/", line 377, in _train
    results = self._step(phase=phase, iterator=step_iterator)
  File "/ddsp-pytorch/train/trainer/", line 336, in _step
    outputs =
  File ".local/share/virtualenvs/ddsp-pytorch-BwyVRuzu/lib/python3.7/site-packages/torch/nn/modules/", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/.local/share/virtualenvs/ddsp-pytorch-BwyVRuzu/lib/python3.7/site-packages/torch/nn/parallel/", line 161, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/.local/share/virtualenvs/ddsp-pytorch-BwyVRuzu/lib/python3.7/site-packages/torch/nn/parallel/", line 171, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/.local/share/virtualenvs/ddsp-pytorch-BwyVRuzu/lib/python3.7/site-packages/torch/nn/parallel/", line 86, in parallel_apply
  File "/.local/share/virtualenvs/ddsp-pytorch-BwyVRuzu/lib/python3.7/site-packages/torch/", line 428, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/.local/share/virtualenvs/ddsp-pytorch-BwyVRuzu/lib/python3.7/site-packages/torch/nn/parallel/", line 61, in _worker
    output = module(*input, **kwargs)
  File "/.local/share/virtualenvs/ddsp-pytorch-BwyVRuzu/lib/python3.7/site-packages/torch/nn/modules/", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/ddsp-pytorch/train/network/autoencoder/", line 64, in forward
    batch = self.encoder(batch)
  File /.local/share/virtualenvs/ddsp-pytorch-BwyVRuzu/lib/python3.7/site-packages/torch/nn/modules/", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "ddsp-pytorch/train/network/autoencoder/", line 101, in forward
    batch["loudness"] = self.loudness_extractor(batch)
  File "/.local/share/virtualenvs/ddsp-pytorch-BwyVRuzu/lib/python3.7/site-packages/torch/nn/modules/", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/ddsp-pytorch/train/../components/", line 79, in forward
    sliced_windowed_signal = sliced_signal * self.smoothing_window
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

Are you using any .cuda() or to('cuda') operations inside the model’s forward method?
If so, you would have to remove them and use e.g. the .device attribute of an internal parameter or the input tensor.
nn.DataParallel will create model replicas on all devices, which would yield a device mismatch if you create tensors in the forward on the default device.

I don’t see .cuda() or (‘cuda’) operations anywhere in the source code… Do you? In i do see this though:

def get_model_device(net):
device = torch.device(“cuda:0”)
#for param in net.parameters():
# device = param.device
# break
return device

I haven’t checked the complete repository, but guess that the returned device might be used somewhere, where it shouldn’t. Could you check, where this method is called and if the device is used to push some activations or parameters to cuda:0?

So on line 163 in they do that map to self.device, and self.device is earlier initialized to the output of get_model_device function above. Firstly, I’m not sure why the model being set to cuda:0 is a problem since if i remember correctly the model actually stays on the default device no? Even if this is wrong, i tried removing the map_location argument and got the same error…

However! I bet line 316-327 in the same link above in really matters. They send the batch to the device, and I’m not sure what I shoudl replace to self.device with

Sending the nn.DataParallel model and the data to the default device is the right approach.
If you haven’t found any device calls inside the model (e.g. in the forward method), could you try to narrow down the code and post an executable code snippet, so that we could debug it?

I’m pretty sure the _step() function is the forward() function though in this case, and they do send the input batch to the default device. Though you say thats okay, but so since step() is forward() I’m not quite sure what you mean.

In the source code I linked, the actual net gets sent to cuda in the file, but as I mentioned before, I removed that and replaced it with:

net = AutoEncoder(config)
net = torch.nn.DataParallel(net, device_ids = list_of_gpus)

The error first comes In “”, which containers the first part of the encoder, so I set a pdb.set_trace() in there:


def forward(self, z):
Compute A-weighted Loudness Extraction
z[‘audio’] : batch of time-domain signals
output_signal : batch of reverberated signals

    input_signal = z['audio']
    paded_input_signal = nn.functional.pad(input_signal, (self.frame_length * 2, self.frame_length * 2))
    sliced_signal = paded_input_signal.unfold(1, self.n_fft, self.frame_length)
    sliced_windowed_signal = sliced_signal * self.smoothing_window
I set the pdb.set_trace() and did self.smoothing_window.cuda and got: device(type='cuda', index=0) and then i also did sliced_signal.get_device() and got 0.

Hi! Just wondering if you have any next steps I can take?

If you get stuck, you could still try to remove all unnecessary code and post a minimal code snippet which would reproduce this issue so that we could debug it.

1 Like