This is expected behavior in the transforms.v2 API according to the docs:
If there is no
ImageorVideoinstance, only the first puretorch.Tensorwill be transformed as image or video, while all others will be passed-through. Here “first” means “first in a depth-wise traversal”.
You could use the tv_tensors classes instead:
img_tv = torchvision.tv_tensors.Image(img)
mask_tv = torchvision.tv_tensors.Mask(mask)
out1, out2 = transforms(img_tv, mask_tv)