Hi,
What is the lowest version number of PyTorch that started supporting CUDA 11.8? Do I need to go all the way up to PyTorch 1.13 to get support?
And to be clear, I’m always compiling from source.
Cheers,
David
Hi,
What is the lowest version number of PyTorch that started supporting CUDA 11.8? Do I need to go all the way up to PyTorch 1.13 to get support?
And to be clear, I’m always compiling from source.
Cheers,
David
I don’t see any CUDA 11.8 - specific changes to enable source builds (besides enabling the Hopper architecture, if needed) and we were using a 1.13.0
-prerelease in our first 11.8 container as seen here.
thanks @ptrblck for the link
It seems that upgrading to cuda 11.8 breaks DDP (at least for ADA arch or RTX 4090 serie)
Are you aware of this issue ?
Do you know a workaround other than training on a single GPU ?
Best
I doubt CUDA 11.8 breaks DDP workloads as I haven’t seen any failures so far with it.
Check a few similar issues posted in this discussion board and post the missing information about your workflow, environment, as well as the log outputs using the debug flags.
thanks @ptrblck
After further investigation the problem was due to NCCL backend trying to use peer to peer (P2P) transport.
Forcing NCCL_P2P_DISABLE=1
fixed the issue
Seems to be a known issue. Unsure who is addressing this between nvidia and amd.
How can I install torch 1.13.1 or anything lower than torch 2.0 when I have cuda 11.8 if I want to do that without installing from source with one of the commands here? there’s only one example for cuda 11.8 but it is given for torch 2.0
when I run
conda install pytorch==1.13.1 pytorch-cuda=11.8 -c pytorch -c nvidia
Torch ends up being installed without cuda support since torch.version.cuda
is empty and
torch.zeros(1).cuda()
gives
<stdin>:1: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /opt/conda/conda-bld/pytorch_1670525493953/work/torch/csrc/utils/tensor_numpy.cpp:77.)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../python3.10/site-packages/torch/cuda/__init__.py", line 221, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
The PyTorch 1.13.1 binaries were built with 11.6 and 11.7 as given here.
Note that the binaries ship with their own CUDA dependencies and your locally installed CUDA toolkit will be used if you build PyTorch from source or a custom CUDA extension.
In case you need to run torch==1.13.1
with CUDA 11.8 you would thus have to build it from source.