Most efficient way to run model on gpu while optimizing input

You could apply CPU-offloading as described here or reduce the number of iterations, if possible.