I am trying to parallelize the initialization of cuda on multiple GPUs. Indeed, I am trying to make my Pytorch model’s initialization faster and I have noticed that this initialization takes about 3s per GPU on a P6000.
A simple code like the following one
pool = mp.get_context('spawn').Pool(torch.cuda.device_count())
pool.map(torch.Tensor([0]).cuda, range(torch.cuda.device_count()))
does not seem to show any performance improvement compared to
for gpu in range(torch.cuda.device_count()):
torch.Tensor([0]).cuda(device=gpu)
It looks like these operations cannot be parallelized. Am I doing something wrong? Is there anyway to make this faster?