Fully utilize GPU memory for trainng


I am training a transformer and I see that the training is only using 1GB memory out of 8, which is obviously training very slowly. I want to know if there’s a way I can parallelize the training on the same GPU. I read posts like these but they seem to talk about multiple GPUs. I also thought of increasing the batch size but that might lead to an efficient model.

So is there a way I can increase the speed by parallelising?

I’m not sure I understand the concern of creating an “efficient model”. Could you explain, why this wouldn’t work or what your concern is?

The used GPU memory is independent from the hardware compute resources used to parallelize the computation (same as you don’t need to fill your host RAM to be able to use all CPU cores).
Increasing the batch size would also increase the compute intensity and could speed up your workload. However, also your current script using 1GB only could already use all hardware resources and you won’t be able to run any other workload in parallel. To check it, you could either run another script on the same device and check the end2end iteration times or use streams and check, if you are seeing overlapping kernels.

Sorry for the confusion. I meant an accurate model. I don’t know why I wrote efficient.

Regarding the 2nd one, I ran some matrix multiplication using the GPU on my terminal and it gve an answer. I did this:

x = torch.rand(100).cuda()
y = torch.rand(100).cuda()

It even showed up in nvidia-smi. So did I do it correctly? Or were you referring to something else?