Slow model training on my local RTX 4070 comp. to Google Colab Tesla T4

Hello everyone. I’m a complete noob and just getting my feet wet with Pytorch.

I’m using Anaconda, version: 23.7.4
Cuda version: 12.2.140
Python version : 3.11.5
Pytorch version: 2.1

I was following some tutorials, and have been using a free Google Colab account until now. Recently bought a new RTX 4070 and was very excited to stretch it’s legs. However, I was disappointed to see that it’s actually slower than a free Google Colab instance with a Tesla T4.

The model I used for testing is taken from one of the tutorials. Here it is:
Google Drive Link to Model

It takes around 3 minutes to hit 100000 epochs on T4 and on my GPU it takes around 3.5 minutes.

Even though the task manager shows the GPU load at 92%, there are some weird things going on.

When I train the model locally on my GPU, CPU0 is getting loaded 100%, while other cores are not stressed much.

Although it is loading a single core, torch.get_num_threads() returns 8. I tried setting it to torch.set_num_threads(os.cpu_count()) (which is 16), it did not change anything. Tried setting it to 1 also, the same result.

GPU temperature, GPU clock and memory clock are all extremely low for 90+% load, as you can see on the picture above.

It’s hovering around 500-700Mhz, when it is capable of 3Ghz.

Theoretically RTX 4070 should be much faster right? More cuda cores, higher frequencies etc. Yes T4 has more VRAM but, I’m hardly using 1Gig of VRAM anyways. I think I’m CPU limited, but even if that’s the case, I suspect that the single core the free Google Colab account gives me doesn’t have better single core performance than my overclocked Ryzen 5700. On the other hand GPU frequency behavior is super weird too.

What might be my problem? Thank you for your help.

Yes, you might be CPU-limited or the bottleneck might come from another part of the code, e.g. data loading. Create a full visual profile, e.g. via Nsight Systems, and compare both profiles to see where the bottleneck is. Alternatively, you could also narrow down the issue further by profiling separate parts of the code in isolation.

I have additional information.

nvidia-smi shows 0% utilization :confused: How the task manager measures the % load is a mystery.

Just as a sanity check, I set the device to CPU and run 100000 epochs couple times. Lo and behold, CPU is much faster :smiley: , solved the same model in 1 minute 40-45 seconds consistently every run. Also all cores were being utilized.

So I guess there are models that are better trained on CPU’s for some reason?

I have no idea how to use Nsight, just poked around on it’s UI. I gave the vs code executable with command line argument -r and pointed the working directory to where the .ipynb is. Launched it, ran the notebook, said no kernel was profiled. I guess it doesn’t work like that. I think it works with the vs code add-on and I should run the code in some sort of a debugging mode with the help of the extension. I need to read some documentation, this tool is alien to me as of now.