While running training on BERT model with cluster of AWS with 16 GPUs (p2.16xlarge) I get run time error:
cuda runtime error (60) : peer mapping resources exhausted at /opt/conda/conda-bld/pytorch_1579040055865/work/aten/src/THC/THCGeneral.cpp:141
If I’m using cluster with 8 GPUs (p2.8xlarge) it works fine.
Each GPU can only have 8 PCIe peers, so you cannot use a p2p mapping with 16 GPUs in PCIe.
Note that this limitation is specific to PCIe and NVLink should work.
I’m new to these issues, how to make it work with NVLink?
Your server would have to use NVLink to connect the GPUs.
You can check the connectivity matrix via
nvidia-smi topo -m.