Allocate cuda tensor in subprocess

Here’s another torch.multiprocessing situation I can’t wrap my head around. This example is working:

import torch
import torch.multiprocessing as mp

def put_in_q():
    x = torch.IntTensor(2, 2).fill_(22)
    x = x.cuda()
    print(x)

p = mp.Process(target=put_in_q, args=())
p.start()
p.join()

But not this one:

import torch
import multiprocessing as mp

class CudaProcess(mp.Process):
    def __init__(self):
        mp.Process.__init__(self)

    def run(self):
        x = torch.IntTensor(2, 2).fill_(22)
        x = x.cuda()
        print(x)

p = CudaProcess()
p.start()
p.join()

The error I’m getting is:

terminate called after throwing an instance of 'THException'
  what():  cuda runtime error (3) : initialization error at .../pytorch/torch/lib/THC/THCGeneral.c:70
1 Like

CUDA multiprocessing is quite complicated, and I wouldn’t recommend it, unless it would give you huge performance benefits. The two most important things to remember are:

  1. You can’t use the default fork start method. You need to switch multiprocessing to either use spawn (simpler) or forkserver (possible more efficient, but still subtle - server needs to be initialized before CUDA is initialized). The problem is that CUDA doesn’t support forking the process once the context has been initialized. The child will be in an inconistent state and it’s UB.
  2. A shared CUDA storage/tensor from another process mustn’t go out of scope in the process that created it, for at least as long as it’s used in other processes. Otherwise, UB again.
2 Likes

I’m not familiar with all the architecture in pytorch, but I would like to share some thoughts if I may. Wouldnt multi-threading mult-gpu work if:

  1. C++ part release GIL systematically
  2. Memory (de)allocation isnt blocking (use memory pool like Tensorflow instead of CachedAllocation)

I guess moving all c++ part to cffi would solve 1) and would make it even possible to use pypy (lua torch used JIT but in python its suddenly a bad idea?)

1 Like

Thank you for the quick reply. For some context, I am trying to launch several agents inheriting mp.Process playing in separate environments, each enqueuing frames in a shared Queue. A single prediction subprocess is dequeuing batches of frames, does an inference step through the neural network estimator and passes back actions to each agent.

I am already seeing nice speed-ups compared to the naive asynchronous implementation and after I changed the start method to spawn it seemed to be running fine on my laptop with cuda support (shaving about 20% of the execution time when using cuda). However when I tested it on our servers I got a RuntimeError: CUDNN_STATUS_NOT_INITIALIZED.

I get now it’s not really trivial setting this up and I will continue working on the implementation leaving cuda aside for now. I do plan to revisit the issue and I’ll post here about any developments.

@apaszke also, how is multi-gpu support being implemented? I’ve seen the Hogwild! example but it’s cpu only.

edit:

The child will be in an inconistent state and it’s UB

What is UB? :slight_smile:

UB=undefined behaviour I guess

1 Like

Yeah, UB is undefined behaviour = anything could happen now.

@kmichaelkills we do release GIL in C++ parts and memory (de)allocation isn’t blocking. You can think of the caching allocator like a self expanding memory pool.

Multi-threaded multi-GPU works, and is used to dispatch kernels by our DataParallel module and data_parallel function. Still, even though we carefully release the GIL as soon as we enter C++, the contention on the lock, that serializes execution, makes it nearly impossible to saturate 8 modern GPUs. That’s why we’ve started exploring CUDA multiprocessing.

Regarding cffi, we didn’t want to add more dependencies, and having more fine grained control over GIL and some other aspects makes it easier for us to develop the C code. I’d love to support PyPy, but there are very few people using it, so it’s not really a top priority for us.

@florin well, the frames are on the CPU already, right? Can’t you cat them there and ship the whole batch to CUDA before giving it to the network? Why do you need CUDA multiprocessing there?

I don’t really understand the thing about multi-GPU. You can switch between GPUs using torch.cuda.device, and that’s all you need to execute code on another device. Alternatively, if you want to have purely asynchronous training, you could start up some threads or processes, have them run on separate GPUs, and sync parameters periodically. This is a use case where CUDA multiprocessing can come in handy, because the shared memory never goes out of scope.

1 Like