Low GPU Usage during Training

Vishal_R · May 18, 2021, 4:44am

Hi! I am training a Convnet to classify CIFAR10 images on RTX 3080 GPU. For some reason, when I look at the GPU usage in task manager, it shows 3% GPU usage as shown in the image.

The model is as follows

class ConvNet(nn.Module):
    
    def __init__(self):
        super(ConvNet,self).__init__()
        
        self.conv1 = nn.Conv2d(in_channels=3,out_channels=8,stride=1,kernel_size=(3,3),padding=1)
        self.conv2 = nn.Conv2d(in_channels=8,out_channels=32,kernel_size=(3,3),padding=1,stride=1)
        self.conv3 = nn.Conv2d(in_channels=32,out_channels=64,kernel_size=(3,3),padding=1,stride=1)
        self.conv4 = nn.Conv2d(in_channels=64,out_channels=128,kernel_size=(3,3),padding=1,stride=1)
        self.conv5 = nn.Conv2d(in_channels=128,out_channels=256,kernel_size=(3,3),stride=1)

        self.fc1 = nn.Linear(in_features=6*6*256,out_features=256)
        self.fc2 = nn.Linear(in_features=256,out_features=128)
        self.fc3 = nn.Linear(in_features=128,out_features=64)
        self.fc4 = nn.Linear(in_features=64,out_features=10)
        
        self.max_pool = nn.MaxPool2d(kernel_size=(2,2),stride=2)
        self.dropout = nn.Dropout2d(p=0.5)
        
    def forward(self,x,targets):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = self.max_pool(x)
        x = self.conv3(x)
        x = F.relu(x)
        x = self.dropout(x)
        x = self.conv4(x)
        x = F.relu(x)
        x = self.max_pool(x)
        x = self.conv5(x)
        x = F.relu(x)
        x = self.dropout(x)
        x = x.view(-1,6*6*256)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        x = F.relu(x)
        x = self.fc3(x)
        x = F.relu(x)
        logits = self.fc4(x)
        
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits,targets)
        return logits,loss
    
    def configure_optimizers(self,config):
        optimizer = optim.Adam(self.parameters(),lr=config.lr,betas=config.betas,weight_decay=config.weight_decay)
        return optimizer

Training Configurations are as follows:
Epochs : 300
Batch Size : 64
Weight Decay : 7.34e-4
Learning Rate : 3e-4
Optimizer : Adam

Also I am running several transforms such as Normalization, RandomRotation, RandomHorizontalFlips.

Also I have another bug. When I change the number of workers in DataLoader, the training just begin at all. In jupyter notebook, it shows that cell is being executed but no output is shown. So I am forced to run with num_workers=0. Anything above 0 breaks for some reason.

jmandivarapu1 · May 18, 2021, 4:46am

try typing

watch nvidia-smi

in your shell at the same time while training is happening it will show the real memory usage of your model and utilization.

Vishal_R · May 18, 2021, 4:50am

I am on Windows 10. I tried using watch nvidia-smi in notebook. It gives syntax error.

Update:

I managed to find a command to get GPU stats. It shows that it is using 14% GPU. Isn’t that low? I am training a big model right?

jmandivarapu1 · May 18, 2021, 4:56am

In jupyternotebook when you press new then click terminal, tyoe the comman there

jmandivarapu1 · May 18, 2021, 4:58am

Yeah even though you use bigger model it depends on the batch size and total computation done by the GPU. Try increasing the batch size and do watch nvidia-smi to see continuously on the memory and utilization.

Vishal_R · May 18, 2021, 5:01am

I increased batch_size to 256. The GPU usage is now 10%. It was 14% when batch_size was 64.

Also I have seen in many YouTube videos that if we use a very large batch size the overall generalization of the model decreases and hence the validation accuracy goes down. Is that true?

jmandivarapu1 · May 18, 2021, 5:04am

Low GPU utilization problem - PyTorch Forums

As you see the link you need to increase the num_workers. As that might be one of the cause

Vishal_R · May 18, 2021, 5:05am

Yeah.

But if I increase the num_workers to say like 2 or something, for some reason, it breaks the training process. It doesn’t start training at all. It only trains when num_workers=0. I don’t know why it is happening.

Vishal_R · May 18, 2021, 5:10am

This is the problem I am getting if I change the num_workers to anything above 0. I don’t know why it doesn’t work for me.

Also I have seen some YouTube videos suggesting to keep the batch_size to 32,64 like that. They tell not to use very large batch_sizes as it reduces the generalization of the model. Is it true?

jmandivarapu1 · May 18, 2021, 5:21am

Yeah generalization of model dependent and according to me depend on the no of classes you have in your dataset. So it is mostly dependent on the type of the dataset you have in hand. But ideally 32,64,128,256 works depends on datasets. If someone have very big images they will use batch sizes like 4 4,8,16 also because of memory constraints.

jmandivarapu1 · May 18, 2021, 5:22am

I am not sure about which trainer you are using.

Vishal_R · May 18, 2021, 5:25am

I am using a custom training loop that I found on Andrej Karpathy’s MinGPT repo. I thought it was a nice way of doing it. Even if I did not use that trainer and used a simple training loop, the DataLoader with num_workers>0 doesn’t work.

Link : trainer.py

jmandivarapu1 · May 18, 2021, 5:34am

can you print loss after line 83

print("---",loss.item())

Vishal_R · May 18, 2021, 5:38am

As you mentioned, I have printed loss after the loss.mean() line in the trainer.

jmandivarapu1 · May 18, 2021, 5:41am

increase the batch size and workers and see if the loss is still printing

Vishal_R · May 18, 2021, 5:43am

No it does not print anything. It gets stuck like this and I cannot interrupt the kernel as well.

jmandivarapu1 · May 18, 2021, 5:48am

can you change batchsize to like 4 or 8 or 16

Vishal_R · May 18, 2021, 5:54am

I am still getting the same result.

I tried running an earlier project that I did for training MNIST digits. There I changed num_workers=2 and ran it in terminal instead of running on notebook.

This is what I have got.

I have used PyTorch’s MNIST dataset itself and trained using the trainer class.

jmandivarapu1 · May 18, 2021, 1:31pm

can you share your complete mnist code here?

Vishal_R · May 18, 2021, 3:00pm

I have uploaded it to my github.

Link : MNIST-PyTorch