Traced/Scripted models do not produce same output as eager models when weights are loaded

JIT traced/scripted models are expected to produce the same output as eager models when given the same output.
This seems to be true when we use randomly initialized weights, but not true when weights are loaded to models.
Below is a code snippet to reproduce the issue:

Without loading weights:

import torch
from torchvision.models import resnet18, ResNet18_Weights

# Create eager, scripted, and traced models
eager_model = resnet18().cuda(0)
eager_model.eval()
script_model = torch.jit.script(eager_model).cuda()
script_model.eval()
trace_input = torch.randn(4, 3, 224, 224).cuda()
traced_model = torch.jit.trace(eager_model, trace_input).cuda()
traced_model.eval()

# Random input and feature extraction
x = torch.randn(16, 3, 224, 224).cuda()

with torch.no_grad():
    eager_out = eager_model(x)
    script_out = script_model(x)
    traced_out = traced_model(x)
    
    if not torch.allclose(eager_out, script_out):
        print(f'Scripted: {(eager_out - script_out).abs().sum()}')
    if not torch.allclose(eager_out, traced_out):
        print(f'Traced: {(eager_out - traced_out).abs().sum()}')

There is no output for the code above.

Using pre-trained weights

Now consider the code below:

import torch
from torchvision.models import resnet18, ResNet18_Weights

# Create eager, scripted, traced models
eager_model = resnet18(weights=ResNet18_Weights.IMAGENET1K_V1).cuda(0)
eager_model.eval()
script_model = torch.jit.script(eager_model).cuda()
script_model.eval()
trace_input = torch.randn(4, 3, 224, 224).cuda()
traced_model = torch.jit.trace(eager_model, trace_input).cuda()
traced_model.eval()

# Random input for feature extraction
x = torch.randn(16, 3, 224, 224).cuda()

with torch.no_grad():
    eager_out = eager_model(x)
    script_out = script_model(x)
    traced_out = traced_model(x)
    
    if not torch.allclose(eager_out, script_out):
        print(f'Scripted: {(eager_out - script_out).abs().sum()}')
    if not torch.allclose(eager_out, traced_out):
        print(f'Traced: {(eager_out - traced_out).abs().sum()}')

The output of the code is as follows:

Scripted: 4.249152183532715
Traced: 4.250102996826172

Any ideas what’s going on? Is this expected behavior? A difference of ~4.25 is small considering that the output is a tensor of [16 x 1000], but not insignificant. Obviously for classification this shouldn’t be much of an issue, but I do suspect that in other cases (where precision is critical) it may pose issues.

BTW I am using torch==1.13.1 and torchvision==0.14.1

I cannot reproduce the issue deterministically in the first iteration, but after a few warmup iterations the abs().max() error increases. I assume this is caused by model optimizations, such as layer fusions etc., which are not creating bitwise-identical outputs. I also don’t see a difference between the random or pre-trained model in torch==2.0.0.

I also noticed something similar during my testing. Even when using torch.ones(4, 3, 224, 224).cuda() as the input, in many cases the first iteration would return identical outputs for all models, while any subsequent iterations would produce a certain amount of error. Again, I’m not sure if this is expected behavior — Does the compiled model still undergo optimizations (e.g., layer fusions as you mentioned) after compilation?

Yes, small errors caused by the limited numerical precision are expected, since the different algorithms will be used, i.e. especially if layers/operations are fused.

You are not using the newly added torch.compile approach, which would optimize the model directly (if I’m not mistaken), but you are scripting/tracing the model. The latter approach uses TorchScript and uses the first (3?) iterations to optimize the model.

I see. It just seems a bit odd that numerical precision issues will show up when weights are loaded, but not when models are randomly initialized.

I am actually looking into torch.compile in version 2.0. However, looking at the official blog post, it seems like the compiled model can’t be “exported” like the traced/scripted models (under the “Serialization” section) . The main reason I want to use traced/scripted models is so that I can train and export a model from one codebase and load it from another without having to copy the model code and call load_state_dict(). Please let me know if I’ve misunderstood this :smile:

I have the same issue with PyTorch 1.12.1 and CUDA 11.3. The traced version of my model with initialized random weights produces outputs closely matching the python model. However, when I load the weights from a checkpoint, the output of the traced model significantly (>2.0) differs from that of the python model!
@ptrblck Any idea how this problem can be avoided?