Bug in DataParallel? Only works if the dataset device is cuda:0

Hi all,

when I am using DataParallel, I notice that it only works when the data is on the first available cuda device.

Say I am doing something like

    for batch_idx, (features, targets) in enumerate(train_loader):
        
        features = features.to(device)
        targets = targets.to(device)

If device is “cuda:1” it throws an error RuntimeError: all tensors must be on devices[0]. However, if I change the device to “cuda:0” the parallel training on multiple GPUs works just fine.

Is that on purpose or a bug? If this is a current limitation, I feel like it should be documented at https://pytorch.org/docs/stable/nn.html?highlight=dataparallel#torch.nn.parallel.DistributedDataParallel

Happy to send a PR, just wanted to know if I am understanding this correctly …

2 Likes

Hi

How are you setting your cuda device ? What Gpus are you using ?

In general to distribute equally to all your GPUs you need to pass “cuda”. By passing “cuda:0” it will only use the GPU 0 not all of them.

On the past I had a GTX 690 (DUAL CORE GPU already in SLI) that only was controlled by the GPU:0 even NVIDIA developers told me that was a bug on the driver. But never fixed.

Could you post the code part where your model is wrapped into DataParallel?
As far as I know, you have to push the data onto the “default” device, i.e. the GPU where the scatter and gather will take place. If there is a mismatch, this error is thrown.

1 Like

In general to distribute equally to all your GPUs you need to pass “cuda”. By passing “cuda:0” it will only use the GPU 0 not all of them.

of course. this was just for the dataset. Let be paste a bit more code for the context

Could you post the code part where your model is wrapped into DataParallel ?


device = torch.device("cuda:0")

...

model = VGG16(num_features=num_features, num_classes=num_classes)
if torch.cuda.device_count() > 1:
    print("Using", torch.cuda.device_count(), "GPUs")
    model = nn.DataParallel(model)

model.to(device)

...

for epoch in range(num_epochs):
    
    model.train()
    for batch_idx, (features, targets) in enumerate(train_loader):
        
        features = features.to(device)
        targets = targets.to(device)

In this case, it’s 4 GPUs, and the whole things seems to work; I get even 3x speed-up compared to the 1 GPU version without DataParallel. It’s also utilizing the default GPU 99% and the others ~95%. It only works if "cuda:0" for the device, not "cuda:1" (with cuda devices 0, 1, 2, 3 available; if it’s weird why I am asking for that: I have left the monitor plugged in into the cuda:0 device, and since it’s weekend, I am working from home :P).

Yeah, I had similar “use cases” in the past. :wink:

Could you try to set output_device=device in nn.DataParallel and set device='cuda:1'?

Thanks, I tried that, still the same issue. In fact, I tried all different combinations

device=torch.device('cuda:1'), device='cuda:1' (when using output_device=device, and transferring models and features)

as well

as output_device=1 Still the same issue.

By passing “cuda:0” it will only use the GPU 0 not all of them.

I think if "cuda:0" is your default/current device, the shouldn’t be a difference between torch.device("cuda") and torch.device("cuda:0") according to the docs at https://pytorch.org/docs/stable/tensor_attributes.html?highlight=torch%20device#torch-device

The parallelism (using all GPUs) is basically achieved via model = nn.DataParallel(model) where it uses all GPUs by default, except you specify only a subset via the device_ids argument of DataParallel).

What Gpus are you using ?

that would be 4 GeForce 1080Tis. They are all the same, but maybe there’s sth weird about the first device regarding the drivers like you said …

Hi again…

What I tried to say was …

device = ("cuda" if torch.cuda.is_available() else "cpu" )
model = net()
if torch.cuda.device_count() > 1:
    # device_ids has a default : all
    model = torch.nn.DataParallel(model, device_ids=[0, 1, 2, 3]) 
model.to(device)

Thanks, I tried that, but then it is still using the cuda:0 device for the data gathering, because on my setup, your example would be equivalent to

device = ("cuda:0" if torch.cuda.is_available() else "cpu" )
model = net()
if torch.cuda.device_count() > 1:
    # device_ids has a default : all
    model = torch.nn.DataParallel(model, device_ids=[0, 1, 2, 3]) 
model.to(device)

(PS: It’s correctly utilizing all 4 GPUs based on what I can see via nvidia-smi, but the gradient averaging etc. is carried out on the first card)

however, I want to use "cuda:1" (or 2, or 3) for the data gathering instead of the 1st device. It’s not super important to me, since I can simply plug in the monitor into a different card, but the discussion is also to see whether there’s a bug in PyTorch.

Then you can try this…
Letting device = “cuda” is generic in case of the pure model setting the first one GPU or in case of DataParallel you need send the exact ids you want or just omit it to get them all

device = ("cuda" if torch.cuda.is_available() else "cpu" )
model = net()
if torch.cuda.device_count() > 1:
    # device_ids has a default : all
    model = torch.nn.DataParallel(model, device_ids=[1,2]) 
model.to(device)

I do have another question for you.

What version are you using? I am using the recent commit of PyTorch what about you ?

Hm, but I still want to use all available GPUs. E.g., to better illustrate what I mean, let me use this DataParallel illustration from https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255:

Basically, in the ones that I circled, I want this to be e.g., GPU-2 not GPU-1. Theoretically, this should be possible by just setting device=torch.device('cuda:1')

and then using

  1. .to(device) for the model and the data tensors.
  2. As well as the output_device=device inside DataParallel

PS: I am using PyTorch 0.4.1, that’s the most recent one I think (except for the release candidates)

Now I get it

Yeah I know this article from this blog.

I believe the problem may be a bug on your version of pytorch that is too old.
I am using 1.0.0a0+479517
Just to mention before 1.0 was a 0.5 preview…

Good that you bring up the version issue. I will upgrade at a later point once the stable release comes out and then revisit that topic.

1 Like

If you have time you can build it or try to use the nightly build to try it. I believe is on the official web site. If not you still can build it. It will take a few hours depending of memory and cores on CPU … but you can try.

Yeah. But I probably will wait a bit since I currently have a well working custom compiled version of the stable one. It’s kind of a bit of work for me so I will wait until the next release comes out. This feature is not so urgent for me personally, but I suspected that there might be a bug in the PyTorch implementation when using DataParallel, because according to the docs, changing the default device (like the ones circled in the figure) should be possible.

1 Like

What if you say device_ids=[1, 2, 3,0] in nn.DataParallel? Or export CUDA_VISIBLE_DEVICES='1,2,3,0' in command prompt?

2 Likes

What if you say device_ids=[1, 2, 3,0] in nn.DataParallel?

Ha! This solves the issue! So when I define

device = torch.device("cuda:1")

for

model.to(device)

(and the data tensors) it breaks when I leave the defaults in DataParallel or have sth like device_ids=[0, 1, 2, 3] in DataParallel but if I then change it to device_ids=[1, 2, 3, 0] it works. Just wondering if this is intended or a bug. @smth

3 Likes

I will let the expert (@smth) to answer you. :slight_smile: IMHO it is intended to be this way.

2 Likes

makes sense then though. Just feel like it should be in the docs then. Happy to send a PR with an “explanation” if that’s indeed just a misunderstanding :slight_smile:

2 Likes