CUDA error: peer mapping resources exhausted

Hi! I’ve been trying to use nn.DataParallel for my model but am encountering this error:

Traceback (most recent call last):
  File "train.py", line 382, in <module>
    main(args, hparams)
  File "train.py", line 373, in main
    train(model, train_dataset, val_dataset, retrieved_tokens, args.output_dir, hparams, device)
  File "train.py", line 314, in train
    train_loss, train_accuracy, train_per_token_accuracy, train_perplexity = train_epoch(model, train_loader, retrieved_tokens, optimizer, hparams, device)
  File "train.py", line 236, in train_epoch
    result = model(x_tokens, retrieved, mask_tokens)
  File "/home/yeh803/.conda/envs/plm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1148, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/yeh803/.conda/envs/plm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 158, in forward
    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
  File "/home/yeh803/.conda/envs/plm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 175, in scatter
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
  File "/home/yeh803/.conda/envs/plm/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 44, in scatter_kwargs
    inputs = scatter(inputs, target_gpus, dim) if inputs else []
  File "/home/yeh803/.conda/envs/plm/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 36, in scatter
    res = scatter_map(inputs)
  File "/home/yeh803/.conda/envs/plm/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 23, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/home/yeh803/.conda/envs/plm/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 19, in scatter_map
    return Scatter.apply(target_gpus, None, dim, obj)
  File "/home/yeh803/.conda/envs/plm/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 96, in forward
    outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
  File "/home/yeh803/.conda/envs/plm/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 189, in scatter
    return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: CUDA error: peer mapping resources exhausted

I tried using 8 GPUs and 4 GPUs and the problem persists. I also tried to export NCCL_P2P_DISABLE=1, but it didn’t work either. I have 80GB free RAM so CPU mem shouldn’t be a problem. Any help would be much appreciated!

The error message indicates that you might be trying to enable more than 8 peer connections.
Are you masking the 8 GPUs via CUDA_VISIBLE_DEVICES? If not, could any device try to enable more than 8 peer connections?

There are 10 GPUs on my compute node and I’ve tried to set os.environ['CUDA_VISIBLE_DEVICES']='0,1' on the top of my python code to limit the GPUs that the model can see to less than 8. Should I do this in the system environment instead of in python code?

Thank you, It seems to work when I prepend CUDA_VISIBLE_DEVICES=0,1 before the script! Interestingly, when I set it to 0,1,2,3,4,5,6,7,8,9, it still worked, albeit only using GPUs 0~7.

Do you have any idea why I need to do this in the script instead of in the code though? Also, would DistributedDataParallel also have such problem? I am thinking of replacing DP with DDP.

You might be setting the env variable too late in your actual script. Once the CUDA context is created, this env variable won’t have any effect anymore, which is why I usually recommend to export it in your current terminal or prepend it to your python command in the terminal.

That is a good idea as DP suffers from some overheads in cloning the model’s state_dict in each forward pass as well as from an imbalanced GPU memory usage. DDP should thus give you a better performance.

1 Like

Thank you so much! Very helpful.