os.environ[CUDA_VISIBLE_DEVICES] does not work well

haoran-hash · September 21, 2021, 10:22am

The code is below.

import torch
from torch import nn
import torch.distributed as dist
import torch.multiprocessing as mp
import os


class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.attr1 = nn.Parameter(torch.tensor([1., 2., 3.]))
        self.register_buffer('attr2', torch.tensor([4., 5., 6.]))
        self.attr3 = torch.tensor([7., 8., 9.])
    
    def forward(self, x, rank):
        hd = x * self.attr1
        self.attr2 = self.attr2 / (rank + 1)
        hd = hd * self.attr2
        self.attr3 = self.attr3.to(rank)
        self.attr3 = self.attr3 / (rank + 1)
        y = hd * self.attr3
        y = y.mean()

        return y


def run(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group('nccl', rank=rank, world_size=world_size)
    # torch.cuda.set_device(rank)
    os.environ['CUDA_VISIBLE_DEVICES'] = f'{rank}'

    my_model = MyModel().to(rank)
    my_model = nn.parallel.DistributedDataParallel(my_model, device_ids=[rank], output_device=rank)
    optimizer = torch.optim.SGD(my_model.parameters(), lr=0.001, momentum=0.9)
    input = torch.tensor([1., 2., 3.]) * (rank + 1)

    optimizer.zero_grad()
    output = my_model(input, rank)
    output.backward()
    if rank == 0:
        print(my_model.module.attr1.grad)
    optimizer.step()

    if rank == 0:
        print(my_model.module.attr1)
        print(my_model.module.attr2)
        print(my_model.module.attr3)


if __name__ == '__main__':
    world_size = 2
    mp.spawn(run, args=(world_size, ), nprocs=2)

    print('执行完毕')

Initially, I write this code in order to see the synchronization mechanism of parameter and buffer in multi GPU training.
Finally, I find torch.cuda.set_device(rank) work well, but os.environ['CUDA_VISIBLE_DEVICES'] not work well. The latter will report an error.
The error information is below.

Hope someone can tell me why.

JuanFMontesinos · September 21, 2021, 11:16am

you have to set it before calling the python code.
It’s not pytorch’s but nvidia’s behaviour.
Devices are assigned to the process before starting python therefore it doesn’t work once u are in.

haoran-hash · September 21, 2021, 2:10pm

so, where should os.environ['CUDA_VISIBLE_DEVICES'] write?
above import torch?

JuanFMontesinos · September 21, 2021, 2:26pm

it shouldn’t be inside the python script but to be set as an enviroment variable in the console such as
CUDA_VISIBLE_DEVICES=0,1 python your_script.py

Note that you SHOUDLN’T t set it as a permanent enviroment variable in the bashrc as it affects the whole system.

haoran-hash · September 21, 2021, 2:44pm

This way I only set the GPU devices to be used for all processes, not each process.
But torch.cuda.set_device() can set GPU device for each process.

JuanFMontesinos · September 21, 2021, 3:13pm

You can manage internally (via torch commands) which gpu to use at any time.
Most of the data parallel funcs allows to set that and you can set the devices manually anyway

Just mentioning that defining cuda_visible_devices inside python won’t work no matter what u do.

haoran-hash · September 22, 2021, 1:26am

So, os.environ['CUDA_VISIBLE_DEVICES] and torch.cuda.set_device() are not conflict.
Use CUDA_VISIBLE_DEVICES=0,1 python your_script.py to set all available GPU devices for all processes. In each process, we can also use torch.cuda.set_device() to specify the GPU device for this process.
Is this the correct understanding?

JuanFMontesinos · October 4, 2021, 10:42pm

Use CUDA_VISIBLE_DEVICES=0,1 python your_script.py to set all available GPU devices for all processes.
I’m not aware of the intrinsecs of torch.cuda.set_device.

Just to mention when you pass device_ids this is a list which enlist the available gpus from the pytorch pov.

For example, if you call
CUDA_VISIBLE_DEVICES=5,7,9 there will be 3 gpus from 0 to 2.
so you can pass device_ids=[0,1,2]

haoran-hash · October 7, 2021, 5:31am

I got it. Thank you.