When running our training code on 4x RTX Blackwell 6000 cards using: nvcr.io/nvidia/pytorch:25.09-py3, our training is stable, but when upgrading to nvcr.io/nvidia/pytorch:25.10-py3 our loss quickly goes to NaN using the same exact training configuration. We upgraded torch/triton to the nightlies, and are still getting NaNs.
Training is (however) stable if we reduce the batch-size by half.
Presumably the later versions of torch/triton are choosing buggy kernels, but how do we open a ticket so someone can replicate this? This seems hardware dependent? Is there a way for us to upload the inductor cache files?
Thanks!