Torch.cuda.device_count() does not return correct number

I am running a script on a slurm cluster with 6 nodes, each with 2 A40 gpus.

world_size = int(os.environ["WORLD_SIZE"])
rank = int(os.environ["SLURM_PROCID"])
gpus_per_node = int(os.environ["SLURM_GPUS_ON_NODE"])
print(f"Hello from rank {rank} of {world_size} on {platform.node()} where there are {gpus_per_node} allocated GPUs per node.", flush=True)
print(gpus_per_node, torch.cuda.device_count(), flush=True)
assert gpus_per_node == torch.cuda.device_count()

I sometimes (not always) get an `AssertionError’, and it seems to be related to torch.cuda.device_count() returns 1.

I try to print(os.environ[‘CUDA_VISIBLE_DEVICES’], flush=True), it returns ‘0,1‘ when torch.cuda.device_count() returns 1.

After browsing around, I still cannot figure out why this happens. I assume this is not because MIG, because MIG is not supported in A40?

I don’t know how it why MIG would be related but you can check if it’s enabled in nvidia-smi and disable it if needed.

1 Like