CUDA + PyTorch + Package Manager on GH200

amill · August 24, 2024, 5:00pm

What is the current best practice for running PyTorch (with CUDA 12) on the NVIDIA Grace Hopper 200 nodes with package management?

The ideal solution would be to have the PyTorch installation within a conda environment, but I see that this is not yet available, as mentioned here.

The alternative I’ve been using is the NGC container, but package installations aren’t persistent when working with the Singularity .sif format. I can decompose the .sif into a writable sandbox, i.e. singularity build --sandbox pytorch pytorch.sif, but I would prefer not to continually rebuild between the .sif after each package installation, and maintaining the sandbox makes a dent in the file limit of my cluster directory.

A third option would be to pip install locally the wheels found here, but I would also like to have, e.g., torchvision.

I also tried building PyTorch from source in a new conda environment but quickly ran into issues when using the compiler in the NVIDIA HPC SDK.

ptrblck · August 24, 2024, 5:23pm

All wheels are available as nightly binaries including torchvision so you could use these. I’m not familiar with Singularity and don’t know what requirements are needed.

amill · August 24, 2024, 8:36pm

I installed the latest nightlies and have most things working, but torch.compile doesn’t see Triton. Do I need to build it from source? I’m not seeing it available with pip.