I have a small model that uses up very little of the GPUs resources. I am looking for the best method to run data through this model as quickly as possible. The order of the data does not matter so no synchronization needs to occur.
I have looked into multiprocessing, however I have been unable to pass the model/tensors as .cuda(), and converting them every time seems to be a significant slowdown.
What is the best way to approach the problem with getting as much out of a single GPU as possible.
High batch size should work. Batch size parallelizes compute so as to maximize GPU cores usage. You could think of a batch size of 1,000 as running 1,000 models side by side to compute the results. Granted, that is a subpar comparison because once you maximize your cores usage, higher batch size won’t reduce total job speed.
Increasing batch size is indeed the easiest way to saturate a GPU. A GPU is asynchronous so you shouldn’t need to parallelize the way you execute jobs on a GPU.
Also if your utlization is very low it could be the case that you’re not using your GPU so using nvidia-smi and making sure that GPU memory allocation is changing to account for the model size and that both model weights and input data are stored on GPU