Run inference manually on a pytorch model layer by layer

Hi all,

My goal is to replicate the forward pass of the model manually, layer by layer, and verify the results.

Setup:

1.I first run the model end-to-end using PyTorch forward method and save the intermediate layer inputs, weights, and outputs (for both convolution and fully connected layers) in CSV format in a directory (original)

#hooks to save intermediate layer output

def attach_hooks(model):
for name, layer in model.named_modules():
if isinstance(layer, (torch.nn.Conv2d, torch.nn.Linear, torch.nn.ReLU, torch.nn.MaxPool2d, torch.nn.AdaptiveAvgPool2d, torch.nn.Dropout)):
layer.register_forward_hook(lambda layer, inp, out, name=name: save_activation(name, layer, inp, out))

run inference on model

model = load_pretrained_vgg11_model()
attach_hooks(model)
output = model(input_tensor)

  1. I created a second script that reads these CSV files and performs the same computations manually for each layer (e.g., convolution, ReLU, pooling, linear layers) and saves the outputs of each layer in a another directory (layer_by_layer)

  2. I compare the outputs from my manual layer-by-layer calculations with the original outputs that were saved during the initial end-to-end run.

The Problem:

  1. When I compare the output tensors from my manual calculations with the stored output tensors for each layer, there are significant differences, especially in deeper layers.
  2. The final prediction from my layer-by-layer inference is incorrect. Instead of predicting “beagle” (which was the correct prediction from the original inference), the model predicts “chain” after softmax is applied to the output of the last fully connected layer.

What I’ve Noticed:

  1. The maximum difference and relative errors between the computed and stored outputs grow progressively larger in deeper layers. For instance:
    For features.0, the maximum difference is ~13.91, with a relative error of ~4569017.
    For features.3, the maximum difference is ~45.15, with a relative error of ~475987.
    For classifier.6 (the final layer), the maximum difference is ~18.54, with a relative error of ~219.

Questions:

  1. Why are there significant differences in the intermediate outputs of the layers, and how might this lead to an incorrect final prediction?
  2. Are there any differences between how PyTorch performs computations in an end-to-end inference compared to my manual layer-by-layer implementation that could explain these discrepancies and the wrong final prediction?
  3. Has anyone encountered similar issues, and how can I reduce these differences to improve accuracy in the layer-by-layer manual calculations?

Any guidance on resolving these discrepancies or improving the accuracy of my manual calculations would be greatly appreciated!

Thanks in advance!

  1. You didn’t describe the model architecture, but I see you are registering the forward hook on nn.Dropout layers, so I would start by disabling these as it’s expected to see different results during the training phase unless the code is carefully seeded.
  2. Yes, there could be differences depending on the actual model implementation in in particular its forward method. You should make sure that your layer-by-layer execution matches the original forward pass.
  3. Yes, but the mismatches were expected due to e.g. the usage of dropout layers.