Py parallel execution of a model with many/few features

Since I have unused GPU memory to fit during serial forecasting of several k model and features.
I’d like to receive your help to implement smart a way of fit such memory (and speed up/ shorten the whole model swipe) by parallelization of picking up a bunch of pre-learned models specific for each feature, pick-up a bunch of different independent features at time and feed concurrently the Cuda GPU with such models to obtain the forecast results.
I’m specifically referring to multivariate forecast time series.
I’ve tried implement both threads,(but seams GIL is incompatible with GPU) and multiprocess but always without success.