Fastest way to sample from high dimensional random normal

At each iteration of a for loop, I would like to get a batch of samples from a multidimensional random normal distribution in python (with dimensionality>10K). Pytorch’s MultivariateNormal seems to be too slow. What is the most efficient function to use when running the script on CPUs and GPUs? Is there an issue with the current implementation of this function? Would it be more efficient to sample in numpy or tensorflow and cast the samples to pytorch tensors?