Unexpected Behavior in Inference Transforms of Pretrained Models

Hi, I have encountered an unexpected behavior in the inference transforms of a pretrained model. The issue occurs when loading images from a dataset with and without the transform function applied.

When the images are loaded with transform applied, everything seems to work as intended, and the model accuracy is reasonably high. However, when loading images without applying transform and then applying it later (after converting the image to a tensor using to_tensor), the resulting images differ from those loaded with transform.

I have tried adjusting the antialias attribute of the transform function, but it did not lead to any improvements.

Here is the code:

# torchvision
import torchvision.transforms as transforms
from torchvision import models
from torchvision.datasets import ImageNet

# torch
import torch
import torch.nn.functional as F

weights = models.ResNet50_Weights.IMAGENET1K_V2
model = models.resnet50(weights=weights).eval().to('cuda')
transform = weights.transforms()

# Setting `antialias` to either `True` or `False` didn't fix things.
# transform.antialias = ...

# Dataset with transform.
dataset = ImageNet('./data/', split='val', transform=transform)
img_1, _ = dataset[111]

# Dataset without transform.
dataset = ImageNet('./data/', split='val')
img_2, _ = dataset[111]

# 1. Convert PIL image to tensor.
# 2. Trnasform it.
img_2 = transforms.functional.to_tensor(img_2)
img_2 = transform(img_2)

# Expected to be same but it's not.
print(img_1 == img_2)

Is there anything I’m missing here? I would appreciate any insights or suggestions to resolve this issue.

Upon reviewing the documentation (Resize — Torchvision main documentation), I discovered that the issue arises from the differences in how PyTorch and PIL handle downsampling.

The warning says:
The output image might be different depending on its type: when downsampling, the interpolation of PIL images and tensors is slightly different, because PIL applies antialiasing. This may lead to significant differences in the performance of a network. Therefore, it is preferable to train and serve a model with the same input types. See also below the antialias parameter, which can help making the output of PIL images and tensors closer.