idohi
(Ido)
March 17, 2020, 10:08am
1
While running training on BERT model with cluster of AWS with 16 GPUs (p2.16xlarge) I get run time error:
cuda runtime error (60) : peer mapping resources exhausted at /opt/conda/conda-bld/pytorch_1579040055865/work/aten/src/THC/THCGeneral.cpp:141
If I’m using cluster with 8 GPUs (p2.8xlarge) it works fine.
1 Like
Each GPU can only have 8 PCIe peers, so you cannot use a p2p mapping with 16 GPUs in PCIe.
Note that this limitation is specific to PCIe and NVLink should work.
idohi
(Ido)
March 19, 2020, 12:47pm
3
I’m new to these issues, how to make it work with NVLink?
Your server would have to use NVLink to connect the GPUs.
You can check the connectivity matrix via nvidia-smi topo -m
.