Pytorch multiprocess worked much slower and failed because of out of memory

Hi, I’m a rookie in parallel.
Recently, I was trying to make my code parallel.
Because of python’s global interpreter lock, I choose to use process parallel.
I tried python’s native lib multiprocess, It raise an error when importing torch.
So I tried pytorch’s multiprocess, it also failed because of some weird errors.
For example, CUDA: Out of Memory, and It shouldn’t happen.
It will occupy tons of memory, and I don’t get it.
I found the logic of multiprocess is also weird.
By the way, when num of subprocesses is low, it won’t fail anymore.
And another tough question is paralleled code is much slower.
Here’s comparison:
8.354568481445312 # paralleled
0.00092315673828125 #unparalleled
When I debug, it seems like every subprocess doesn’t only run functions, they run from the start.

from multiprocessing.dummy import freeze_support
from time import time
import torch
from torch import multiprocessing as mp
def f(x):
    x *= torch.randn((3,24,24),device='cuda:0')
if __name__ ==  "__main__":
    freeze_support()# for windows support
    data = torch.randn(16,3,24,24,device='cuda:0')
    mp.set_start_method('spawn')# fail log prompt me to run this
    pool = mp.Pool(mp.cpu_count())
    start = time()
    for i in data:
    start = time()
    for i in data:

Full Error Log
My Running Environment:
torch 1.10.0+cu113
Thank you!

This kind of parallelization strategy is somewhat unusual as many PyTorch models can be expressed without explicitly relying on multiprocessing/pools (e.g., via parallel dataloaders, distributed data-parallel, etc.). Additionally, in a single GPU setup (as this appears to be), parallelizing CPU processes but ultimately dispatching to the same GPU may increase contention and not actually yield any speedup.

Could you share some more details about the workload/use case that you are trying to speed up?

1 Like

Ok, here’s a slice of my original code

        for key_id, key in enumerate(self.keys):
            pmf_matrix = np.random.choice(
                    self.PerturbationTable, self.shapes[key_id], p=self.prop_table)
            dic[key] *= torch.tensor(pmf_matrix, device=self.device

As you see, for every layer in a CNN, I dot multiply it with a matrix.
So far, it runs relatively slowly, about 4 iterations per second in resnet18. (RTX3090)
2.5 iterations per second on my laptop (RTX3050).
I hope it could be faster.
Thus I want to speed up it by invoking the parallel technique.
Is my idea of speeding up is wrong?
This code couldn’t speed up anymore when I only have one graphic card.
Beside that, print in my subprocess doesn’t really print anything.

My other idea is to let subprocess 1 use GPU, and the other subprocesses use CPU.
But it seems like, calling a subprocess is much wastable.