Different results, and -sometimes- all nans

I noticed when doing inference on my custom model (that was always trained and evaluated using cuda) produces slightly different inference results on cpu. (The percent of instances with different classes is small, but still of interest).

However, I then tried it on mps as was blown away. Not only are the results different than cuda and cpu, but some of the examples will produce all nans when running on mps. Furthermore, it isn’t deterministic–which examples are all nans will change randomly.

I’ll add that I’m running this before everything (although, I understand most or all of these only apply to cuda, which is what the code has always been run on)

os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8" # must set variable beforehand
torch.use_deterministic_algorithms(True)
# When running on the CuDNN backend, two further options must be set
torch.backends.cudnn.benchmark = False # type: ignore
torch.backends.cudnn.deterministic = True # type: ignore

What is your code for inference?

I’m assuming you’ve already set the model to inference mode / eval mode?

Yes, the model is always set with .eval() before inference (and .train() before training).

What is your code for inference?

I’m curious how much code you are looking for here? I’ll add, I’ve -never- gotten nans on CUDA (and I’ve been running this same code, more or less, for several years), and when I recently tested on CPU, I also did not get any nans.

Okay, I think I traced where the issue first crops up when using mps (and only mps, not cuda or cpu). The problem is because something is corrupting the input tensors before they even get to the forward pass.

I’ve verified that the issue is occurring with .to(device, non_blocking=True). Only when non_blocking is True. Strangely, if I access the tensor (just store a reference to it, but never use it, before the to operation, then the corruption doesn’t happen. I can certainly add a conditional to check for ‘cuda’ (I guess, same with pin_memory for the dataloader), but the other discussions I can find online suggest this shouldn’t be an issue for mps (or was, but not with pytorch 2.4.1).

# this is a very simplified snippet of the code

for _, batch_data in enumerate(loader):
    # if I uncomment this next line, then no issue
    # dummy_variable = batch_data["input_tensor"]

    # if non_blocking is False, then no issue
    batch_data["input_tensor"] = batch_data["input_tensor"].to(device, non_blocking=True)

I’ve since added torch.mps.synchronize() for the condition that mps is being used (which I didn’t think about, even though I had torch.cuda.synchronize() already there). I still find it strange that accessing the variable before the copy/move to mps was also a solution.