I have a tensor that I want to have data_ptr being aligned to 16 bytes as I’m passing it to some CUDA extension that uses vectorized load.
.contiguous()
doesn’t always work. Is .clone()
guarantee to work (e.g. clone of a tensor of size 16 will always have memory aligned to 16 bytes)?
import torch
a = torch.randn(32, dtype=torch.bfloat16, device='cuda')
print(a.data_ptr() % 16) # 0
b = a[1:17] # tensor has size 16
print(b.data_ptr() % 16) # 2
print(b.contiguous().data_ptr() % 16) # 2
print(b.clone().data_ptr() % 16) # 0