Is it anyway I can sample random numbers directly on GPU, avoiding sampling on CPU and then transfer, such as:

torch.tensor.randn(10,10).cuda() #REALLY SLOW

maybe adding to the torch.distribution package a way to tell sampling is performed on GPU, because the source code of this methods always use torch.tensor but not torch.cuda.tensor.

You can use methods such as uniform_() or normal_() on cuda tensors to generate random numbers directly on gpu.
Keep in mind that torch.rand(10, 10) is equivalent to torch.FloatTensor(10, 10).uniform_() and torch.randn(10, 10) is equivalent to torch.FloatTensor(10, 10).normal_().
So torch.cuda.FloatTensor(10, 10).normal_() will do what you want.

I actually thinkg you should be more consistent with the names. In this case:

randn(mu,std)->samples a tensor of given shape with a gaussian distribution of mean mu and stdev std

normal(mu_vec,std_vec) Returns a Tensor of random numbers drawn from separate normal distributions who’s mean and standard deviation are given. (directly from your doc).

So there is no way to think that normal_ will do what randn does in a Tensor inplace method. Maybe adding randn_ to do what randn does and normal_ for what normal does. One inplace (GPU, CPU) and one returning a tensor CPU