How does torch.randn generate variates?

ph14 · August 8, 2021, 7:26pm

How does the function torch.randn generate variates? What specific method does it use, and where can I find a reference (if there is one)? Also, does the method change based on CPU/GPU usage?

KFrank · August 9, 2021, 4:41am

Hi Piers!

As near as I can tell, randn() uses the Box-Muller method, both on
the cpu and gpu.

I am not aware that the algorithm used is documented anywhere
but in the code. It’s also quite opaque to me how any given pytorch
python call gets dispatched down to the code that actually does the
work.

But my best guess for the cpu is:

github.com

pytorch/pytorch/blob/main/aten/src/ATen/native/cpu/DistributionTemplates.h#L84


      
                return random(generator);
              });
            });
          }
          
          template<typename RNG>
          struct RandomKernel {
            void operator()(TensorIteratorBase& iter, c10::optional<Generator> gen) {
              random_kernel(iter, check_generator<RNG>(gen));
            }
          };
          
          // ==================================================== Normal ========================================================
          
          #ifdef CPU_CAPABILITY_AVX2
          static void normal_fill_16_AVX2(float *data,
                                   const __m256* two_pi,
                                   const __m256* one,
                                   const __m256* minus_two,
                                   const __m256* mean,
                                   const __m256* std_v) {

and for the gpu is:

github.com

pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/DistributionTemplates.h#L414


      
              distribution_nullary_kernel<scalar_t, accscalar_t, curand4_engine_calls/2>(iter,
                gen,
                [] __device__ (curandStatePhilox4_32_10_t* state) { return curand_uniform2_double(state); },
                transform);
            } else {
              distribution_nullary_kernel<scalar_t, accscalar_t, curand4_engine_calls>(iter,
                gen,
                [] __device__ (curandStatePhilox4_32_10_t* state) { return curand_uniform4(state); },
                transform);
            }
          }
          
          template<typename scalar_t, typename accscalar_t, size_t curand4_engine_calls, typename RNG, typename transform_t>
          void normal_and_transform(TensorIteratorBase& iter, RNG gen, transform_t transform) {
            if (std::is_same<scalar_t, double>::value) {
              distribution_nullary_kernel<scalar_t, accscalar_t, curand4_engine_calls/2>(iter,
                gen,
                [] __device__ (curandStatePhilox4_32_10_t* state) { return curand_normal2_double(state); },
                transform);
            } else {
              distribution_nullary_kernel<scalar_t, accscalar_t, curand4_engine_calls>(iter,

The gpu code calls down into the curand_normal() set of functions
in NVIDIA’s cuRAND library.

Quoting from the cuRAND documentation:

__device__ float2
curand_normal2 (curandState_t *state)
...

The above functions generate two normally or log normally distributed pseudorandom results with each call. Because the underlying implementation uses the Box-Muller transform, this is generally more efficient than generating a single result with each call.

As described above the cpu and gpu versions are implemented
entirely independently. It appears, however, that they both end up
using the same underlying Box-Muller algorithm (near as I can tell …).

Best.

K. Frank