I have a small model that uses up very little of the GPUs resources. I am looking for the best method to run data through this model as quickly as possible. The order of the data does not matter so no synchronization needs to occur.
I have looked into multiprocessing, however I have been unable to pass the model/tensors as .cuda(), and converting them every time seems to be a significant slowdown.
What is the best way to approach the problem with getting as much out of a single GPU as possible.