Autocast not consistent across different GPUs (A100 and RTX A6000)

issue #108627

:bug: Describe the bug

I train and inference a classifier using autocast. Result is different accross diffenent GPUs (same .venv, code and data).
The result on A100 is much superior than on RTX A6000.
Not using autocast with ctx = nullcontext() on RTX A6000, gets the similar result to A100 with autocast.
I get no torch warnings on either machine.

ctx = torch.amp.autocast(device_type='cuda', dtype=torch.bfloat16)

#training
with ctx:
    logits, loss = classifier(X, Y) 
#inference
with ctx:
    logits, loss = classifier(X, None)   

P.S. This has cost me one month of my business time.

Is there a reason you are cross-posting the same issue directly?

Just to increase the chances to get a solution. Is it discouraged?

Yes, since multiple users could debug the same issue thus wasting time.

OK, now I know, thank you. You can delete the topic please.