Multiprocessing CUDA memory

The evaluator could run on CPU but with a big performance hit because model 1 is a big Convolutional VAE, and model 2 is a big LSTM, so inference time gets pretty long on CPU since it also has to emulate the environment !
Is it required to re-initialized CUDA entirely on every new process even if we just want to infer to an already allocated memory address on a process that has itself already initialized CUDA ?
If so, any tips to reduce the memory footprint of the initialization (and in general perhaps) ?
I am currently at 414MiB per new process.
Thanks !