Low GPU Usage during Training

Yeah even though you use bigger model it depends on the batch size and total computation done by the GPU. Try increasing the batch size and do watch nvidia-smi to see continuously on the memory and utilization.

I increased batch_size to 256. The GPU usage is now 10%. It was 14% when batch_size was 64.

Also I have seen in many YouTube videos that if we use a very large batch size the overall generalization of the model decreases and hence the validation accuracy goes down. Is that true?

Low GPU utilization problem - PyTorch Forums

As you see the link you need to increase the num_workers. As that might be one of the cause

Yeah.

But if I increase the num_workers to say like 2 or something, for some reason, it breaks the training process. It doesn’t start training at all. It only trains when num_workers=0. I don’t know why it is happening.

This is the problem I am getting if I change the num_workers to anything above 0. I don’t know why it doesn’t work for me.

Also I have seen some YouTube videos suggesting to keep the batch_size to 32,64 like that. They tell not to use very large batch_sizes as it reduces the generalization of the model. Is it true?

Yeah generalization of model dependent and according to me depend on the no of classes you have in your dataset. So it is mostly dependent on the type of the dataset you have in hand. But ideally 32,64,128,256 works depends on datasets. If someone have very big images they will use batch sizes like 4 4,8,16 also because of memory constraints.

I am not sure about which trainer you are using.

I am using a custom training loop that I found on Andrej Karpathy’s MinGPT repo. I thought it was a nice way of doing it. Even if I did not use that trainer and used a simple training loop, the DataLoader with num_workers>0 doesn’t work.

Link : trainer.py

can you print loss after line 83

print("---",loss.item())

As you mentioned, I have printed loss after the loss.mean() line in the trainer.

increase the batch size and workers and see if the loss is still printing

No it does not print anything. It gets stuck like this and I cannot interrupt the kernel as well.

can you change batchsize to like 4 or 8 or 16

I am still getting the same result.

I tried running an earlier project that I did for training MNIST digits. There I changed num_workers=2 and ran it in terminal instead of running on notebook.

This is what I have got.

I have used PyTorch’s MNIST dataset itself and trained using the trainer class.

can you share your complete mnist code here?

I have uploaded it to my github.

Link : MNIST-PyTorch

Unfortunately I think I can’t help much. As the file in your github running successfully for me for different batch sizes and differ workers also. My gpu utilization is around 22%. But I am using linux machine and I do have 32 cpu’s in it.

But I did found one interesting article about num_wprkers>0 on windows mightnot work and how to fix it
solution is
( Errors when using num_workers>0 in DataLoader - PyTorch Forums)

def train():
    # Here was inserted the whole code that train the network ...
if __name__ == '__main__':
    train()

So I tried modifying your files but as I mentioned I don’t have windows system so try my below code in MNIST.py file if doesn’t work please follow the above link

import numpy as np
import torch
import torch.nn as nn
import torchvision

from torchvision.transforms import ToTensor
from torch.utils.data import Dataset
from torchvision.utils import make_grid
from torch.utils.data import random_split

import matplotlib.pyplot as plt

from model import MnistMLP,MnistCNN
from trainer import TrainerConfig,Trainer
from visualize import Plot

train_set = torchvision.datasets.MNIST(
    root="./data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_set = torchvision.datasets.MNIST(
    root="./data",
    train=False,
    download=True,
    transform=ToTensor()
)

if __name__ == '__main__': 
    cnn_train_configs = TrainerConfig(ckpt_path="./CNNModel.pt",max_epochs=40,learning_rate=4.67e-4,weight_decay=6.423e-4)
    CNN_model = MnistCNN()
    trainer = Trainer(model=CNN_model,train_dataset=train_set,test_dataset=test_set,config=cnn_train_configs)
    model_metrics = trainer.train()

    plotter = Plot(model_metrics=model_metrics)
    plotter.plot()

It works!

My GPU is now being used around 20% during training.

Thank you so much for helping me out.

1 Like

I had a similar problem for my training. It ended being GPUs were waiting for IO and CPU to finish the work. I could see in my GPU metrics that GPU was working 100% sometimes but most of the time they were waiting. I had to change my approach how I loaded data.

1 Like

I think I met the same problem, may I know how did you solve it?
Thank you!