How to get PyTorch to use Ampere GPU (GPU util < 15%)?

Hi, I’m trying to train a model that uses mlagents. After much debugging on this site and stackoverflow, I was able to get everything set up and working. The problem is that as soon as I started training, my GPU wasn’t being fully utilized (much at all). It’s a dynamic range but stays around the 10% range. I’m not really sure if that’s normal as this is my first time training a model using my GPU, it also dips to 0% at times when theres no activity. My CPU on the hand was being fully utilized.

If this isn’t supposed to be the case, then I suspect that the problem is related to my CUDA version? I’m using an RTX3090 but my env is as follows:

Ubuntu 22.04
python: 3.10.4
mlagents: 0.30.0
mlagents: 0.30.0
torch: 1.11.0
gym: 0.26.0
torchaudio: 0.11.0
torchvision: 0.12.0
cudatoolkit: 11.5.2

I was able to get this mlagents example working only because of this stackoverflow answer.

The reason why I think this is a CUDA related error is because the 3090 apparently doesn’t support CUDA 11.5. However, a mod from here in a different post said that’s not a problem if we install torch with the binaries (which is what I thought I already did from that stackoverflow post). Now I realize it doesn’t seem like any binaries were installed with torch in that command.

In addition, nvcc isn’t installed on my machine because the toolkit for CUDA 11.5 requires Ubuntu 20.04 when I have 22.04, and I don’t really want to downgrade (if I can avoid it).

Here’s a picture of nvidia-smi while training:

It does look like the training process (soccertwo/bin/python.310) is running on the GPU, but it’s not that high.

Running torch.version.cuda in the same env: ‘11.5’

Update So I ran into this old post where ptrblck mentions PyTorch binaries doesn’t come with CUDA 11.5, and this might be my problem. Does this mean I need to downgrade to a lower CUDA version and then the PyTorch binaries will work with it?

That’s not the case as every CUDA 11 release supports your 3090. Where did you get this wrong information from?

The PyTorch binaries ship with their own CUDA dependencies and your locally installed CUDA toolkit will be used if you build PyTorch from source or a custom CUDA extension.

No, since your locally installed CUDA toolkit won’t be used and PyTorch will use the one you’ve selected during the installation.

To check why your GPU shows a low utilization you could profile your code with e.g. Nsight Systems to check if e.g. your CPU is too slow and creates a bottleneck. You could also enable TF32 for matmuls via torch.backends.cuda.matmul.allow_tf32 = True and generally you should update PyTorch to the latest version.

Thanks @ptrblc! I assumed this whole time I was doing something wrong, but it’s a relief knowing that everything till this point’s correct. I’ll take a look at the Nsight Systems documentation.

For allowing tf32 (or flags in PyTorch in general), is it enough to just open Python, import torch and paste the command? If so, I’m not noticing any difference in GPU utilization after resuming training.

Unfortunately, I have no choice but to use this outdated version for class :frowning:.

Yes, you can directly call the line of code in your script to enable TF32 for matrix multiplications.
Note however, that this would potentially speed up matmuls and could even show a lower GPU utilization if the real bottleneck was coming from another part of the pipeline.

I’ve written a small tutorial here on how to profile PyTorch using nsys. Once you’ve created the profile check if whitespaces between kernels are visible and if the kernel launches are close to the actual kernel execution, which would point towards a CPU-limited workload.