How to avoid copying data when using multiprocessing and CUDA?

I want to train multiple independent models on GPU using multiprocessing. The data is the same for all the processes. A fork method uses Copy-On-Write where if the data is read-only (as is mine), the child processes will not make a copy of the data. From the multiprocessing documentation, I see that CUDA only supports spawn or forkserver methods. However, with these methods, the child process makes separate copies of the data and the memory increases linearly with the number of processes.

How I can have common data used by all the child workers while training on GPU? There is no communication between the processes and the data is read-only. If I don’t use GPU, the fork method works fine, but obviously I’m interested in making it work on the GPU.

I haven’t used nn.parallel.DistributedDataParallel but I believe it is more suitable for training a single model and distributing the batches across different devices. My use case is different where I am training separate models on single/multiple GPUs and the training data need not be distributed. Is it possible use nn.parallel.DistributedDataParallel here?