Unexpected results without clone (cpp extension)

I write a cpp extension and it is successfully compiled and installed.
I have a CUDA kernel like this:

__global__ void g(const float *x, float *y);

And a function with torch’s argument type like this:

torch::Tensor f(const torch::Tensor &x) {
    auto y = torch::zeros_like(x);
    g<<< 64, 1024, 0, at::cuda::getCurrentCUDAStream() >>> 
        (x.clone().data<float>(), y.data<float>());
    return y;
}

This gives me expected results. However, when I remove clone():

torch::Tensor f(const torch::Tensor &x) {
    auto y = torch::zeros_like(x);
    g<<< 64, 1024, 0, at::cuda::getCurrentCUDAStream() >>>
        (x.data<float>(), y.data<float>());
    return y;
}

It gives me different and unexpercted results.
I can’t understand why this happens because x is const in both f and g.
Can someone tell me the reason of above phenomenon?

You didn’t ask about this, using x.clone().data<float>() looks very fishy to me in terms of ownership of the memory pointer. RAII (i.e. memory lives as long as you have the object) means you need to assign x.clone() to a local variable (I thought that what you do was a compile error previously, but apparently I am mistaken). This is even more important given that CUDA is asynchronous.

What you likely do in your kernel (which you didn’t show, but which is likely where the error lies) is to expect contiguous tensors. The right thing to do is to define a local variable auto xc = x.contiguous()
If you need elementwise access, passing PackedTensorAccessors to your kernels from xc.packed_accessor<...>() is a good way to pass tensors along with size and stride info.

Best regards

Thomas

Thanks so much.
It works to set the variable contiguous.
I am not familiar with PackedTensorAccessors and I can’t find related topics on the Internet. Can you provide an example of how to use it or some related materials?

I should SEO optimize my blog more (it’s the second item on DuckDuckGo, but not on the first page at Google) :wink: . My blog post on a Sinkhorn kernel uses and comments on them with a full code example.
I also used them for PyTorch’s own batch norm implementation. That and the C++ extension tutoral have the latest refactoring with packed_accessor32 (32 bit indexing is much faster on CUDA than 64 bit).

Best regards

Thomas