Cuda initalization time on a multi-GPU machine for inference

I am trying to parallelize the initialization of cuda on multiple GPUs. Indeed, I am trying to make my Pytorch model’s initialization faster and I have noticed that this initialization takes about 3s per GPU on a P6000.
A simple code like the following one

pool = mp.get_context('spawn').Pool(torch.cuda.device_count())
pool.map(torch.Tensor([0]).cuda, range(torch.cuda.device_count()))

does not seem to show any performance improvement compared to

for gpu in range(torch.cuda.device_count()):
    torch.Tensor([0]).cuda(device=gpu)

It looks like these operations cannot be parallelized. Am I doing something wrong? Is there anyway to make this faster?