RuntimeError cuda runtime error (711) Data Parallel

krishneel · May 20, 2020, 12:07pm

When using nn.DataParallel(model) there seems to be an error cuda runtime error (711). I tried to search what is 711 but I couldn’t figure out. I am assuming that this is not a PyTorch related but incase if anyone knows why this happens.

Without nn.DataParallel(model) the model trains without any issue.

    prediction = self._model(images)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 151, in forward
    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in scatter
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 36, in scatter_kwargs
    inputs = scatter(inputs, target_gpus, dim) if inputs else []
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 28, in scatter
    res = scatter_map(inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 13, in scatter_map
    return Scatter.apply(target_gpus, None, dim, obj)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 89, in forward
    outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
  File "/opt/conda/lib/python3.7/site-packages/torch/cuda/comm.py", line 147, in scatter
    return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: cuda runtime error (711) : peer mapping resources exhausted at /opt/conda/conda-bld/pytorch_1587428398394/work/aten/src/THC/THCGeneral.cpp:136

ptrblck · May 21, 2020, 8:56am

How many GPUs are you using?
This error might be raised, if you are using too many GPUs for a peer2peer connection (8 max. per GPU).

krishneel · May 21, 2020, 10:39am

No, there are only 3 GPUs on the machine.

ThanosM · May 21, 2020, 5:19pm

I have the same issue with 2 GPUs. Did you find a solution? Thanks.

ptrblck · May 21, 2020, 9:11pm

Could both of you check the connectivity via nvidia-smi topo -m and also run p2pBandwidthLatencyTest?

krishneel · May 22, 2020, 1:14am

Thank you for your response. The outputs are below

Output for nvidia-smi topo -m

	GPU0	GPU1	GPU2	CPU Affinity
GPU0	 X 	PHB	SYS	0-13,28-41
GPU1	PHB	 X 	SYS	0-13,28-41
GPU2	SYS	SYS	 X 	14-27,42-55

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

Output for p2pBandwidthLatencyTest

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce GTX 1080 Ti, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce GTX 1080 Ti, pciBusID: 4, pciDeviceID: 0, pciDomainID:0
Device: 2, GeForce GTX 1080 Ti, pciBusID: 81, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=1 CAN Access Peer Device=0
Device=1 CANNOT Access Peer Device=2
Device=2 CANNOT Access Peer Device=0
Device=2 CANNOT Access Peer Device=1

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2
     0	     1     1     0
     1	     1     1     0
     2	     0     0     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2
     0 377.42  11.55  11.45
     1  11.50 354.79  11.45
     2  11.38  11.41 355.11
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
Cuda failure p2pBandwidthLatencyTest.cu:185: 'peer mapping resources exhausted'

ptrblck · May 22, 2020, 7:20am

Thanks for the update. Could you create an issue here and post the information from your last post there, please?

krishneel · May 22, 2020, 10:04am

For future reference, I have opened an issue related to this problem at NVIDIA/nccl .