How is permutation implemented in PyTorch cuda

Fei_Wang1 · March 5, 2019, 5:32pm

Hi, I am interested to find out how PyTorch cuda implement permutations. However, I cannot find it in the repo.

I assumed that there should be a GPU kernel function for permutation in PyTorch/aten/src/ATen/native/cuda/, but I didn’t find it in either TensorTransformation.cu or TensorFactories.cu.

Could someone point me to the implementation? Thanks.

Fei

LeviViana · March 5, 2019, 11:49pm

Take a look here:

github.com

pytorch/pytorch/blob/4404762d7dd955383acee92e6f06b48144a0742e/aten/src/ATen/native/cuda/TensorFactories.cu#L74-L111


Tensor& randperm_out_cuda(Tensor& result, int64_t n, Generator* generator) {
AT_CHECK(n >= 0, "n must be non-negative, got", n);
AT_CHECK(at::scalar_tensor(n, result.options()).defined(),
"n is too large for result tensor type: '", result.type().toString(), "'");


result.resize_({n});


if (result.type().scalarType() == at::ScalarType::Half) {
  auto result_float = at::empty({n}, initialTensorOptions().device(Device(DeviceType::CUDA)));
  result.copy_(randperm_out_cuda(result_float, n, generator));
} else {
  if (n < 30000) {  // For small inputs, we offload it to CPU instead.
    auto result_cpu = at::empty({n}, result.options().device(kCPU));
    randperm_out(result_cpu, n, generator);
    result.copy_(result_cpu);
  } else {
    // Generate random values for the keys array
    AT_DISPATCH_ALL_TYPES(
      result.type(), "randperm_out_cuda", [&] {
        auto keys = at::empty(result.sizes(), result.options()).random_(generator);

This file has been truncated. show original

Fei_Wang1 · March 6, 2019, 4:21pm

Hi LeviViana,

Thanks for getting to me so quickly. However, I think maybe I didn’t express my question clearly. I was hoping to find the implementation for permutation, as in:

a = a.permute(2, 0, 1),

where a is a 3D PyTorch tensor, and the code permute a’s dimensions such that the inner most dimension (2) is changed to the outer most (0).

the code is equivalent to code like this:

a = a.transpose(1, 2).transpose(0, 1).contiguous()

The code you pointed out seems to be for random permutations, not permutations of dimensions, if I understood it correctly.

Fei

JuanFMontesinos · March 6, 2019, 4:42pm

permutation calls to transpose

Fei_Wang1 · March 6, 2019, 5:06pm

Hi JuanFMontesinos,

Thanks for making that clear.

But how does PyTorch implement transpose on GPU?

I didn’t find the source code in
pytorch/aten/src/ATen/native/cuda/
pytorch/aten/src/ATen/native/cudnn/
It will be great if you can point me to the correct file.

Fei

InnovArul · March 6, 2019, 5:09pm

torch.permute() is carried out just by changing the strides of the dimensions (similar to numpy).

Just to quote from IPython Cookbook - 4.5. Understanding the internals of NumPy to avoid unnecessary array copying :

When reshaping an array, NumPy avoids copies when possible by modifying the strides attribute. For example, when transposing a matrix, the order of strides is reversed, but the underlying data remains identical

Here you can look at the code:

github.com

pytorch/pytorch/blob/a6170573c898a1367517d8daf8e777abaf96f752/aten/src/ATen/native/TensorShape.cpp#L367-L385


      
          Tensor permute(const Tensor& self, IntArrayRef dims) {
            auto nDims = self.dim();
            AT_CHECK(dims.size() == (size_t)nDims,
                     "number of dims don't match in permute");
            auto oldSizes = self.sizes();
            auto oldStrides = self.strides();
            std::vector<int64_t> newSizes(nDims);
            std::vector<int64_t> newStrides(nDims);
            std::vector<bool> seen(nDims);
            for (int64_t i = 0; i < nDims; i++) {
              auto dim = maybe_wrap_dim(dims[i], nDims);
              AT_CHECK(!seen[dim],
                       "repeated dim in permute");
              seen[dim] = true;
              newSizes[i] = oldSizes[dim];
              newStrides[i] = oldStrides[dim];
            }
            return self.as_strided(newSizes, newStrides);
          }

Fei_Wang1 · March 8, 2019, 3:21pm

Hi InnovArul,

Thanks for addressing it. That is indeed interesting insights.

However, what if the operation following permutations requires the data layout to be packed in certain way. For instance, in the model DeepSpeech, permutation is used right before RNN layers. If the data before permutation is column-major, and RNN input has to be column-major (as required by cudnnRNNForwardTraining), then we cannot just change the strides for permutation (data is no longer column-major if the stride of the last dimension is not 1).

Is this the case where some sort of data copying has to happen?

P.S. Proof that cudnnRNNForwardTraining requires the input to be column-major, from:

https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnRNNForwardTraining

Input vectors are expected to be arranged in the column-major order so strides in xDesc should be set as follows: strideA[0]=inputSize, strideA[1]=1, strideA[2]=1.

========= update ================
I realized that in cases like this, a .contiguous() function is used to copy the data into column-major form. I am looking at how PyTorch implement copy() at this moment. No outstanding questions for now

Thanks,

Fei