I have a tensor that I want to have data_ptr being aligned to 16 bytes as I’m passing it to some CUDA extension that uses vectorized load.
.contiguous() doesn’t always work. Is
.clone() guarantee to work (e.g. clone of a tensor of size 16 will always have memory aligned to 16 bytes)?
import torch a = torch.randn(32, dtype=torch.bfloat16, device='cuda') print(a.data_ptr() % 16) # 0 b = a[1:17] # tensor has size 16 print(b.data_ptr() % 16) # 2 print(b.contiguous().data_ptr() % 16) # 2 print(b.clone().data_ptr() % 16) # 0