How to make the best use of a single GPU with a small model

I have a small model that uses up very little of the GPUs resources. I am looking for the best method to run data through this model as quickly as possible. The order of the data does not matter so no synchronization needs to occur.

I have looked into multiprocessing, however I have been unable to pass the model/tensors as .cuda(), and converting them every time seems to be a significant slowdown.

What is the best way to approach the problem with getting as much out of a single GPU as possible.

High batch size should work. Batch size parallelizes compute so as to maximize GPU cores usage. You could think of a batch size of 1,000 as running 1,000 models side by side to compute the results. Granted, that is a subpar comparison because once you maximize your cores usage, higher batch size won’t reduce total job speed.