Hi, I’m trying to train a model that uses mlagents. After much debugging on this site and stackoverflow, I was able to get everything set up and working. The problem is that as soon as I started training, my GPU wasn’t being fully utilized (much at all). It’s a dynamic range but stays around the 10% range. I’m not really sure if that’s normal as this is my first time training a model using my GPU, it also dips to 0% at times when theres no activity. My CPU on the hand was being fully utilized.
If this isn’t supposed to be the case, then I suspect that the problem is related to my CUDA version? I’m using an RTX3090 but my env is as follows:
Ubuntu 22.04
python: 3.10.4
mlagents: 0.30.0
mlagents: 0.30.0
torch: 1.11.0
gym: 0.26.0
torchaudio: 0.11.0
torchvision: 0.12.0
cudatoolkit: 11.5.2
I was able to get this mlagents example working only because of this stackoverflow answer.
The reason why I think this is a CUDA related error is because the 3090 apparently doesn’t support CUDA 11.5. However, a mod from here in a different post said that’s not a problem if we install torch with the binaries (which is what I thought I already did from that stackoverflow post). Now I realize it doesn’t seem like any binaries were installed with torch in that command.
In addition, nvcc isn’t installed on my machine because the toolkit for CUDA 11.5 requires Ubuntu 20.04 when I have 22.04, and I don’t really want to downgrade (if I can avoid it).
Here’s a picture of nvidia-smi while training:
It does look like the training process (soccertwo/bin/python.310) is running on the GPU, but it’s not that high.
Running torch.version.cuda in the same env: ‘11.5’
Update So I ran into this old post where ptrblck mentions PyTorch binaries doesn’t come with CUDA 11.5, and this might be my problem. Does this mean I need to downgrade to a lower CUDA version and then the PyTorch binaries will work with it?