How to ensure that tensor data_ptr is aligned to 16 bytes

I have a tensor that I want to have data_ptr being aligned to 16 bytes as I’m passing it to some CUDA extension that uses vectorized load.
.contiguous() doesn’t always work. Is .clone() guarantee to work (e.g. clone of a tensor of size 16 will always have memory aligned to 16 bytes)?

import torch
a = torch.randn(32, dtype=torch.bfloat16, device='cuda')
print(a.data_ptr() % 16)  # 0
b = a[1:17]  # tensor has size 16
print(b.data_ptr() % 16)  # 2
print(b.contiguous().data_ptr() % 16)  # 2
print(b.clone().data_ptr() % 16)  # 0