I am training a transformer and I see that the training is only using 1GB memory out of 8, which is obviously training very slowly. I want to know if there’s a way I can parallelize the training on the same GPU. I read posts like these but they seem to talk about multiple GPUs. I also thought of increasing the batch size but that might lead to an efficient model.
So is there a way I can increase the speed by parallelising?
I’m not sure I understand the concern of creating an “efficient model”. Could you explain, why this wouldn’t work or what your concern is?
The used GPU memory is independent from the hardware compute resources used to parallelize the computation (same as you don’t need to fill your host RAM to be able to use all CPU cores).
Increasing the batch size would also increase the compute intensity and could speed up your workload. However, also your current script using 1GB only could already use all hardware resources and you won’t be able to run any other workload in parallel. To check it, you could either run another script on the same device and check the end2end iteration times or use streams and check, if you are seeing overlapping kernels.