You didn’t ask about this, using x.clone().data<float>()
looks very fishy to me in terms of ownership of the memory pointer. RAII (i.e. memory lives as long as you have the object) means you need to assign x.clone()
to a local variable (I thought that what you do was a compile error previously, but apparently I am mistaken). This is even more important given that CUDA is asynchronous.
What you likely do in your kernel (which you didn’t show, but which is likely where the error lies) is to expect contiguous tensors. The right thing to do is to define a local variable auto xc = x.contiguous()
If you need elementwise access, passing PackedTensorAccessor
s to your kernels from xc.packed_accessor<...>()
is a good way to pass tensors along with size and stride info.
Best regards
Thomas