Training not stopping until interrupted on RTX 3060 Ti

Rana_Wasil · December 4, 2022, 1:14pm

Hi,
I recently bought RTX 3060 Ti GPU before that I used to work on the google collab free version (Tesla T4 GPU). I am working on a Computer vision project so I was working with YOLO v5 on collab so now I want to shift over to my local pc. I downloaded yolov5 on my local machine and made an environment variable for it and downloaded the required dependency libraries.
How I downloaded the PyTorch and Cuda:

Made an Environment variable using VS Community
Installed Pip
Then used the Pytorch site to download CUDA (cuda 11.7) I used below command:
“pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu117”

Afterward, I verified if the GPU is working using below commands:

import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

and it displayed the GPU there. Then I ran training for 3 epochs to test my gpu with same dataset which I used on collab that took only around 30 seconds to complete (Tesla T4 which has around 2000 Cuda cores less than RTX 3060 Ti)on the other hand my GPU kept running for around 3 hours but didn’t stop (So I Interrupted it)

NOTE: I didn’t download or used cudnn (as I followed a youtube tutorial and the guy didn’t download that). Secondly, When I start training my C Drive got filled up (I don’t know why) so I uninstalled python from C and downloaded it to D (but that again didn’t help). Thirdly my GPU utilization is avg 15 to 20 while training

Please help me in this matter I shall be really thankful to you guys as I have my University FYP due and this is making issues in doing my work smoothly.

ptrblck · December 4, 2022, 7:10pm

Check which CUDA runtime you’ve installed and make sure it’s 11.x as older ones will not work with your Ampere GPU.
Once done, add print debug statements to your code to check which lines are executes, where it gets stuck, or if the script is running as expected but is just slow in your local setup.

Rana_Wasil · December 5, 2022, 3:23pm

This is the cuda my system has. Though I have downloaded 11.7 cuda via pytorch using the script in pip in the environment variable I created for PyTorch

.
I used only 5 images in training for 3 epochs to test and the Epochs info is not showing but this time it displayed the output.

Rana_Wasil · December 8, 2022, 9:06am

I have somehow made it work but now can you please let me know if it’s taking too much time while training as GPU utilization is 3 percent and CPU is 18 percent even though it’s showing GPU enabled in pytorch though I am running it in terminal not jupyter

ptrblck · December 8, 2022, 11:24pm

A GPU utilization of 3% is quite low and based on your screenshot you are also using 0.75GB for your training? This could indicate a small batch size, so you might want to increase it.
Also, profiling the code (e.g. via Nsight Systems) would be helpful to narrow down bottlenecks of your code.

Rana_Wasil · December 10, 2022, 5:35am

Oh Ok, But when I put a batch size of 6 is get a Memory error. I am new to this and I don’t know what to do in this matter that’s why put 5 batch size.