os.environ[CUDA_VISIBLE_DEVICES] does not work well

The code is below.

import torch
from torch import nn
import torch.distributed as dist
import torch.multiprocessing as mp
import os


class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.attr1 = nn.Parameter(torch.tensor([1., 2., 3.]))
        self.register_buffer('attr2', torch.tensor([4., 5., 6.]))
        self.attr3 = torch.tensor([7., 8., 9.])
    
    def forward(self, x, rank):
        hd = x * self.attr1
        self.attr2 = self.attr2 / (rank + 1)
        hd = hd * self.attr2
        self.attr3 = self.attr3.to(rank)
        self.attr3 = self.attr3 / (rank + 1)
        y = hd * self.attr3
        y = y.mean()

        return y


def run(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group('nccl', rank=rank, world_size=world_size)
    # torch.cuda.set_device(rank)
    os.environ['CUDA_VISIBLE_DEVICES'] = f'{rank}'

    my_model = MyModel().to(rank)
    my_model = nn.parallel.DistributedDataParallel(my_model, device_ids=[rank], output_device=rank)
    optimizer = torch.optim.SGD(my_model.parameters(), lr=0.001, momentum=0.9)
    input = torch.tensor([1., 2., 3.]) * (rank + 1)

    optimizer.zero_grad()
    output = my_model(input, rank)
    output.backward()
    if rank == 0:
        print(my_model.module.attr1.grad)
    optimizer.step()

    if rank == 0:
        print(my_model.module.attr1)
        print(my_model.module.attr2)
        print(my_model.module.attr3)


if __name__ == '__main__':
    world_size = 2
    mp.spawn(run, args=(world_size, ), nprocs=2)

    print('执行完毕')

Initially, I write this code in order to see the synchronization mechanism of parameter and buffer in multi GPU training.
Finally, I find torch.cuda.set_device(rank) work well, but os.environ['CUDA_VISIBLE_DEVICES'] not work well. The latter will report an error.
The error information is below.

Hope someone can tell me why.

you have to set it before calling the python code.
It’s not pytorch’s but nvidia’s behaviour.
Devices are assigned to the process before starting python therefore it doesn’t work once u are in.

so, where should os.environ['CUDA_VISIBLE_DEVICES'] write?
above import torch?

it shouldn’t be inside the python script but to be set as an enviroment variable in the console such as
CUDA_VISIBLE_DEVICES=0,1 python your_script.py

Note that you SHOUDLN’T t set it as a permanent enviroment variable in the bashrc as it affects the whole system.

This way I only set the GPU devices to be used for all processes, not each process.
But torch.cuda.set_device() can set GPU device for each process.

You can manage internally (via torch commands) which gpu to use at any time.
Most of the data parallel funcs allows to set that and you can set the devices manually anyway :slight_smile:

Just mentioning that defining cuda_visible_devices inside python won’t work no matter what u do.

So, os.environ['CUDA_VISIBLE_DEVICES] and torch.cuda.set_device() are not conflict.
Use CUDA_VISIBLE_DEVICES=0,1 python your_script.py to set all available GPU devices for all processes. In each process, we can also use torch.cuda.set_device() to specify the GPU device for this process.
Is this the correct understanding? :thinking:

Use CUDA_VISIBLE_DEVICES=0,1 python your_script.py to set all available GPU devices for all processes.
I’m not aware of the intrinsecs of torch.cuda.set_device.

Just to mention when you pass device_ids this is a list which enlist the available gpus from the pytorch pov.

For example, if you call
CUDA_VISIBLE_DEVICES=5,7,9 there will be 3 gpus from 0 to 2.
so you can pass device_ids=[0,1,2]

I got it. Thank you.