Inference non-deterministic on Windows CPU


I have observed non-determinism on the Windows CPU build of torch==1.6.0, related to fully-convolutional models’ ability to “tile”.

By “tiling”, I am referring to the property of translation equivariance, whereby translating the input to a function results in an output that is translated by the same amount. This of course is the basic logic behind convolution, and it can be straightforwardly generalized to models which consist only of convolutional layers (including transpose conv, nearest-neighbor upsampling, and pooling).

Tiling is important because it allows us to keep memory usage down when running inference on very large (>1 megapixel) images. Instead of running a single inference pass on the entire image, we “tile” it into smaller chunks, then invoke the model on each tile in sequence. This does not result in any seams between the outputs if the model is fully convolutional (although special care must be taken to respect the stride and pad amount of the model).

We have a test to check whether a model is able to tile:

import torch

from my_fully_convolutional_model import build_model

# Build the fully-convolutional model. It has the following attributes: 
# "min_input_size", "min_output_size", and "stride". 
model = build_model(...)

# Set the number of tiles along each axis (for a total of num_tiles**2).
num_tiles = 4

# Sample a random image with the correct dimensions.
input_image = torch.rand(
    model.min_input_size + (num_tiles - 1) * model.stride,
    model.min_input_size + (num_tiles - 1) * model.stride

# Run the model in a single pass over the entire image.
output_image = model(input_image).detach()

# Allocate a tensor for the "tiled" output.
tiled_output_image = torch.zeros_like(output_image)

# Invoke the model on each tile separately:
for i in range(num_tiles):
    for j in range(num_tiles):
        # Compute the upper-left indices for the current tile.
        start_h, start_w = i * model.stride, j * model.stride

        # Slice the input image to acquire the tile.
        input_tile = input_image[
            start_h : start_h + model.min_input_size,
            start_w : start_w + model.min_input_size
        # Assign the output of the model to the corresponding location in the output image.
            start_h : start_h + model.min_output_size,
            start_w : start_w + model.min_output_size] = model(input_tile)

# Assert that the tiled output image matches the non-tiled output image exactly. 
assert (output_image == tiled_output_image).all().item()

This test always passes for fully convolutional models on macOS and Linux CPU builds. However, only on Windows CPU, certain models fail! The PSNR between the tiled result and non-tiled result is roughly 40dB. To find the cause, I have tried (to no avail):

  • Setting the random seeds in python, numpy, and torch
  • Disabling the GPU device with export CUDA_VISIBLE_DEVICES=-1
  • Disabling cuda benchmark and enabling cuda determinism (although this should have no effect as we run this test on CPU only).
  • Setting the number of threads to 1 with torch.set_num_threads(1)
  • Downgrading torch from 1.6.0 to 1.4.0
  • Upgrading Python from 3.7 to 3.8
  • Swapping out calls to F.interpolate(..., mode="nearest") with F.pixel_shuffle and nn.ConvTranspose2d
  • many other things that I am forgetting right now

Even stranger, not all Windows machines are able to reproduce this error. For instance, a dual-booted Macbook does not get this error, while two other Windows machines do, but each one on different models (all of which pass the test on macOS or Linux).

Has anyone else observed anything like this? I wanted to post here before escalating to a bug report, especially since reproducing the error across systems eludes us right now.


P.S. this test also passes when using ONNX as the inference engine, instead of torch.