I have a small model that uses up very little of the GPUs resources. I am looking for the best method to run data through this model as quickly as possible. The order of the data does not matter so no synchronization needs to occur.
I have looked into multiprocessing, however I have been unable to pass the model/tensors as .cuda(), and converting them every time seems to be a significant slowdown.
What is the best way to approach the problem with getting as much out of a single GPU as possible.
High batch size should work. Batch size parallelizes compute so as to maximize GPU cores usage. You could think of a batch size of 1,000 as running 1,000 models side by side to compute the results. Granted, that is a subpar comparison because once you maximize your cores usage, higher batch size won’t reduce total job speed.
Increasing batch size is indeed the easiest way to saturate a GPU. A GPU is asynchronous so you shouldn’t need to parallelize the way you execute jobs on a GPU.
Also if your utlization is very low it could be the case that you’re not using your GPU so using
nvidia-smi and making sure that GPU memory allocation is changing to account for the model size and that both model weights and input data are stored on GPU
Other profiling tools like the pytorch profiler will help you root cause your issue more quickly https://github.com/pytorch/kineto/tree/main/tb_plugin