Making pynvml match torch device ids (CUDA_VISIBLE_DEVICES)

stas · November 16, 2020, 9:19pm

Yesterday, I discovered pynvml doesn’t respect CUDA_VISIBLE_DEVICES, so the ids of torch.cuda and nvml don’t match if you need to change which gpus are used via CUDA_VISIBLE_DEVICES, which breaks the software that uses pynvml.

I wrote a re-mapper to solve the problem:

github.com

stas00/ipyexperiments/blob/a60a23141e2ecb11c0554f8be535fe6ccb831604/ipyexperiments/utils/mem.py#L33-L44


def get_nvml_gpu_id(torch_gpu_id):
    """
    Remap torch device id to nvml device id, respecting CUDA_VISIBLE_DEVICES. 

    If the latter isn't set return the same id
    """
    # if CUDA_VISIBLE_DEVICES is used automagically remap the id since pynvml ignores this env var
    if "CUDA_VISIBLE_DEVICES" in os.environ:
        ids = list(map(int, os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")))
        return ids[torch_gpu_id] # remap
    else:
        return torch_gpu_id

may be it might be of use to others.

Usage:

    [...]
    torch_gpu_id = torch.cuda.current_device()
    nvml_gpu_id = get_nvml_gpu_id(torch_gpu_id)
    handle = pynvml.nvmlDeviceGetHandleByIndex(nvml_gpu_id)
    [....]

I also suggested for pynvml to integrate it directly https://github.com/gpuopenanalytics/pynvml/issues/28