RuntimeError: add does not support non-contiguous NestedTensor inputs

I am trying to implement a more memory-efficient multi-head attention that uses NestedTensors instead of key_padding_mask. I am running into this error on backward pass, however: “RuntimeError: add does not support non-contiguous NestedTensor inputs”.

Does anyone know what this means?

import torch
device = torch.device('cuda')
q = torch.nested.nested_tensor([torch.randn(5, 64, device=device) for _ in range(3)]).contiguous()
kv = torch.nested.nested_tensor([torch.randn([torch.randint(3, 11, [1]), 64], device=device) for _ in range(3)]).contiguous()
q.requires_grad = True
kv.requires_grad = True

# Lifted from _scaled_dot_product_attention_math
attn = torch.matmul(q, kv.transpose(-2, -1))
attn = attn.softmax(-1)
attn = torch.nn.functional.dropout(attn, 0.1, True)
out = torch.matmul(attn, kv)

sum([t.mean() for t in out]).backward()