I am trying to parallelize the initialization of cuda on multiple GPUs. Indeed, I am trying to make my Pytorch model’s initialization faster and I have noticed that this initialization takes about 3s per GPU on a P6000.
A simple code like the following one
pool = mp.get_context('spawn').Pool(torch.cuda.device_count()) pool.map(torch.Tensor().cuda, range(torch.cuda.device_count()))
does not seem to show any performance improvement compared to
for gpu in range(torch.cuda.device_count()): torch.Tensor().cuda(device=gpu)
It looks like these operations cannot be parallelized. Am I doing something wrong? Is there anyway to make this faster?