I have an EncoderDecoder architecture which uses custom RNN in the decoder, and consequently GPU is not fully utilized during decode phase when the single instance of model is used (target platform is V100). The model is exported to TorchScript to optimize Python code and custom RNN.
Currently, my workaround was to spawn multiple workers which will handle requests concurrently and hence utilize GPU better (this was also confirmed when measuring real-world data throughput). The problem is each worker needs to host not only this, but multiple models and each of them has its own copy of the model. This kind of setup sometimes causes memory problems as I’m keeping several copies of each model in GPU memory (every worker has a copy of each model).
- Since Python doesn’t have a real notion of threads, I was wondering is it possible to load GPU model(s) in C++ in main thread, spawn several worker threads and share model between those threads so each thread handles its own workload independently?
- If answer to 1. is positive, are there any blockers for performance and utilization improvement (e.g. kernels from same CUDA context cannot overlap etc.)?
EDIT: I should have added more context on why GPU is not fully utilized in inference, so I’m updating the post. The problem is not about validation inference where I can control the batch size and fully utilize GPU(s). This is all about real-world scenario where I’m processing a request and decoding single sequence at the time, maybe a few more if dynamic batching is successful, but the problem being solved by the network is not easy to batch (both input&output have dynamic axes).