I removed the last version of PyTorch2.0 from December 26th that had due to the security fix.
pip3 uninstall -y torch torchvision torchaudio torchtriton
I then reinstalled PyTorch 2.0
pip3 install numpy --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu117
I have CUDA 11.7, 4 A6000 GPUs, Driver 525.60.13.
The training doesn’t start and the kernel dies immediately after filling the GPU memory.
If I revert back to the PyTorch version from December 26th, everything works fine.