What is the current best practice for running PyTorch (with CUDA 12) on the NVIDIA Grace Hopper 200 nodes with package management?
The ideal solution would be to have the PyTorch installation within a conda
environment, but I see that this is not yet available, as mentioned here.
The alternative I’ve been using is the NGC container, but package installations aren’t persistent when working with the Singularity .sif
format. I can decompose the .sif
into a writable sandbox, i.e. singularity build --sandbox pytorch pytorch.sif
, but I would prefer not to continually rebuild between the .sif
after each package installation, and maintaining the sandbox makes a dent in the file limit of my cluster directory.
A third option would be to pip install
locally the wheels found here, but I would also like to have, e.g., torchvision
.
I also tried building PyTorch from source in a new conda
environment but quickly ran into issues when using the compiler in the NVIDIA HPC SDK.