Multiple processes of the same network

I want to perform deep ensembles in my network and I have a question related to how to perform different training times in parallel.
With a batch size of 16 my network takes like 3 minutes per epoch and 2GB of GPU.

Having enough GPU memory, I was expecting that throwing 4 processes It would take also 3 min, but it took 12 minutes.

Clearly, I am doing something wrong or I don’t understand parallelism in GPU usage …

Could anyone help me in this matter?

The GPU might be utilized completely (regarding the compute), while the memory isn’t fully used (similar to your PC, where the CPU might be fully utilized, while your RAM is quite empty).
Usually PyTorch kernels are implemented in such a way to try to fully use the GPU, so that running multiple models in parallel might not be possible (you would also have to use CUDA streams for it, which might be tricky).