Cuda initalization time on a multi-GPU machine for inference

I am trying to parallelize the initialization of cuda on multiple GPUs. Indeed, I am trying to make my Pytorch model’s initialization faster and I have noticed that this initialization takes about 3s per GPU on a P6000.
A simple code like the following one

pool = mp.get_context('spawn').Pool(torch.cuda.device_count())[0]).cuda, range(torch.cuda.device_count()))

does not seem to show any performance improvement compared to

for gpu in range(torch.cuda.device_count()):

It looks like these operations cannot be parallelized. Am I doing something wrong? Is there anyway to make this faster?