Run time error when using driver with 16 GPUs

While running training on BERT model with cluster of AWS with 16 GPUs (p2.16xlarge) I get run time error:

cuda runtime error (60) : peer mapping resources exhausted at /opt/conda/conda-bld/pytorch_1579040055865/work/aten/src/THC/THCGeneral.cpp:141

If I’m using cluster with 8 GPUs (p2.8xlarge) it works fine.

1 Like

Each GPU can only have 8 PCIe peers, so you cannot use a p2p mapping with 16 GPUs in PCIe.
Note that this limitation is specific to PCIe and NVLink should work.

I’m new to these issues, how to make it work with NVLink?

Your server would have to use NVLink to connect the GPUs.
You can check the connectivity matrix via nvidia-smi topo -m.