Hello,
I have a for loop which makes independent calls to a certain function. The calls should be processed in parallel, as they are completely independent. I have the following code which works for CPU. For GPU I am still trying to get it working.
import multiprocessing
from joblib import Parallel, delayed
class some_function(Function):
@staticmethod
def forward(ctx, x):
pass # here goes the code of the forward pass
@staticmethod
def backward(ctx, grad_output):
pass
class L_Field(nn.Module):
def __init__(self):
super(L_Field, self).__init__()
def forward(self, x):
batch_size, channels, height, width = x.shape
num_cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=num_cores, backend="threading")(delayed(some_function.apply)(x[b, c]) for b in range(batch_size) for c in range(channels))
out = torch.stack(results).reshape(batch_size, channels, height, width)
return out
For the GPU, I tried the following:
import torch.multiprocessing as mp
class some_function(Function):
@staticmethod
def forward(ctx, x):
pass # here goes the code of the forward pass
@staticmethod
def backward(ctx, grad_output):
pass
class L_Field(nn.Module):
def __init__(self):
super(L_Field, self).__init__()
def forward(self, x):
batch_size, channels, height, width = x.shape
num_cores = mp.cpu_count()
x.share_memory_()
pool = mp.Pool(processes=num_cores)
results = [pool.apply(some_function.apply, args=(x[b, c],)) for b in range(batch_size) for c in range(channels)]
pool.close()
pool.join()
out = torch.stack(results).reshape(batch_size, channels, height, width)
return out
For which I get:
RuntimeError: Cowardly refusing to serialize non-leaf tensor which requires_grad, since autograd does not support crossing process boundaries. If you just want to transfer the data, call detach() on the tensor before serializing (e.g., putting it on the queue).
If I do the detach:
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Then, if I do mp.set_start_method('spawn', force=True)
:
RuntimeError: CUDA error: out of memory
Any ideas will be appreciated.