Multiple GPU with os CUDA_VISIBLE_DEVICES does not work

thancaocuong · April 19, 2020, 2:03am

Hi, I’ve tried to set CUDA_VISIBLE_DEVICES = ‘1’ in main function but when I move the model to cuda, It does not move to GPU1 but GPU0 instead (result in OOM due to GPU0 is in use). Please tell me if I’m wrong.
here is my code:
in train.py:

def main(config_file_path):
    config = SspmYamlConfig(config_file_path)

    dataloader_cfg = config.get_dataloader_cfg()
    trainer_cfg = config.get_trainer_cfg()
    logger_cfg = config.get_logger_cfg()
    model_cfg = config.get_model_cfg()
    pose_dataset_cfg = config.get_pose_dataset_cfg()
    data_augmentation_cfg = config.get_augmentation_cfg()
    target_generator_cfg = config.get_target_generator_cfg()

    learning_rate = trainer_cfg['optimizer']['learning_rate']
    # parsing device = [1] by config
    device = ','.join(list(map(str, trainer_cfg['device'])))
    os.environ['CUDA_DEVICE_ORDER']= 'PCI_BUS_ID'
    os.environ['CUDA_VISIBLE_DEVICES'] = device 
    model = getModel(model_cfg)
    train_loader = DataLoader(train_dataset, **dataloader_cfg['train'])
    val_loader = DataLoader(val_dataset, **dataloader_cfg['val'])
    trainer = Trainer(
                      model, optimizer, logger,
                      writer, config, train_loader, val_loader
                      )

Trainer class is inherited from BaseTrainer where the model was transferd to cuda

class BaseTrainer(ABC):
    def __init__(self, model, optimizer, logger, writer, config):
        self.config = config
        self.logger = logger
        self.writer = writer
        self.optimizer = optimizer
        self.trainer_config = config.get_trainer_cfg()
        self.device_list = self.trainer_config['device'] #device list is [1]
        self.device_type = self._check_gpu(self.device_list)
        self.device = torch.device(self.device_type)
        self.model = model
        self.model = self.model.to(self.device)
        self.model = torch.nn.DataParallel(self.model)
    def _check_gpu(self, gpus):
        if len(gpus) > 0 and torch.cuda.is_available():
            pynvml.nvmlInit()
            for i in gpus:
                handle = pynvml.nvmlDeviceGetHandleByIndex(i)
                meminfo = pynvml.nvmlDeviceGetMemoryInfo(handle)
                memused = meminfo.used / 1024 / 1024
                self.logger.info('GPU{} used: {}M'.format(i, memused))
                if memused > 1000:
                    pynvml.nvmlShutdown()
                    raise ValueError('GPU{} is occupied!'.format(i))
            pynvml.nvmlShutdown()
            return 'cuda'
        else:
            self.logger.info('Using CPU!')
            return 'cpu'

ptrblck · April 19, 2020, 2:04am

If you are masking devices via CUDA_VISIBLE_DEVICES all visible devices will be mapped to device ids in the range [0, nb_visible_devices].
E.g. if your system has two GPUs and you are using CUDA_VISIBLE_DEVICES=1, you would have to access it inside the script as cuda:0.

thancaocuong · April 19, 2020, 2:08am

thank you for your quick reply. but I have a question:
I have 3 GPUs, when I want to use only GPU1 and 2 (GPU0 is in use). how should I do?

ptrblck · April 19, 2020, 2:12am

If all devices are the same, use
CUDA_VISIBLE_DEVICES=1,2 python script.py args
to run the script and inside the script use cuda:0 and cuda:1 (or the equivalent .cuda(0), .cuda(1) commands).

However, if the mapping is not what you expect via nvidia-smi, you could force the PCI bus order order via CUDA_DEVICE_ORDER=PCI_BUS_ID in front of the aforementioned command.

thancaocuong · April 19, 2020, 2:29am

I’ve found that I need to set VISIBLE device at the begining of my script. I’s my mistake, thank you for your help. I will close this topic

r00bi · May 17, 2023, 2:32pm

Hi,
I have 8 gpus, suddenly the vm couldn’t recognize them and returns the error of “raise RuntimeError(“No GPUs available.”)
RuntimeError: No GPUs available.”

when I export CUDA_VISIBLE_DEVICES=1 , it could find the gpu#1 and run code on it, however, I need to use all 8 gpus so set export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 but again getting the same error of no gpus available.
How can I fix it.

ptrblck · May 17, 2023, 3:59pm

I don’t know what might be causing your issues and you might need to check if your script is explicitly setting CUDA_VISIBLE_DEVICES somewhere.

Manh_Do_Duy · July 2, 2024, 9:43am

I am facing the same issue.
I tried:

os.environ["CUDA_VISIBLE_DEVICES"] = "1,2" 
os.environ["WORLD_SIZE"] = "1"

device = torch.device("cuda:1")

model.load_state_dict(torch.load(model_path, map_location=device))

but it didn’t work.
Do I have to do any config?