Invalid device pointer using multiprocessing with CUDA

The following multiprocessing code works fine with DEVICE=cpu, but seems to fail with CUDA:

import torch
import torch.multiprocessing as mp

DEVICE = "cuda"


def proc(queue):
    queue.put(torch.tensor([1.], device=DEVICE))


if __name__ == "__main__":
    ctx = mp.get_context("spawn")
    queue = ctx.Manager().Queue()
    p = ctx.Process(target=proc, args=(queue,))
    p.daemon = True
    p.start()
    print("received: ", queue.get())

throws an error:

Traceback (most recent call last):
  File "examples/ex2.py", line 17, in <module>
    print("received: ", queue.get())
  File "<string>", line 2, in get
  File "/home/npradhan/miniconda3/envs/pytorch-master/lib/python3.6/multiprocessing/managers.py", line 772, in _callmethod
    raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError:
---------------------------------------------------------------------------
Unserializable message: Traceback (most recent call last):
  File "/home/npradhan/miniconda3/envs/pytorch-master/lib/python3.6/multiprocessing/managers.py", line 283, in serve_client
    send(msg)
  File "/home/npradhan/miniconda3/envs/pytorch-master/lib/python3.6/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/npradhan/miniconda3/envs/pytorch-master/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/home/npradhan/miniconda3/envs/pytorch-master/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 161, in reduce_tensor
    (device, handle, storage_size, storage_offset) = storage._share_cuda_()
RuntimeError: invalid device pointer: 0x7f89d4200000 at /home/npradhan/workspace/pyro_dev/pytorch/aten/src/THC/THCCachingAllocator.cpp:262

---------------------------------------------------------------------------

I also noticed that the error thrown is different across slightly different variants of the above code. e.g. using mp.Queue instead of mp.Manager().Queue() seems to throw a cuda runtime error (30). Changing to using a FloatTensor constructor instead of a tensor constructor throws RuntimeError: legacy constructor for device type: cpu was passed device type: cuda, but device type must be: cpu. I think my mental model might be incorrect - so any help understanding what is happening here would be appreciated.

@Neeraj can you try to keep those Tensors alive on the workers until you access them on the main process, and see if that fixes it – atleast for debugging

Thanks for the suggestion, @smth. I am seeing the same exception get triggered when the main process tries to fetch via queue.get(), even though the worker can access the tensor fine.

@smth - Fun fact: if I make the worker sleep for 5 seconds, the following happens:

  • If we use mp.Manager().Queue(), the problem still persists as I reported earlier.
  • If we change to using mp.Queue(), then the problem is resolved (as you were expecting!).

The python docs (and many other resources) seem to point out that mp.Manager().Queue() can be used wherever we are using mp.Queue(), but it does seem that torch CUDA tensors somehow do not play well with mp.Manager().Queue()? I have found using Manager().Queue() is better for handling clean up / preventing deadlock when one of the worker dies or is interrupted, and that is the reason why I was using it (it works great with CPU tensors). The other question was that even if we were to move to using mp.Queue(), I wouldn’t be able to control workers retaining access to the transferred tensors, so would be curious to know the best practices around that. I’m happy to open an issue instead if you think that’s a better channel to discuss this.

mp.Manager().Queue() is very likely not overriding these functions that will use our ForkingPickler (which shared CUDA memory etc. etc.): https://github.com/pytorch/pytorch/blob/master/torch/multiprocessing/queue.py#L34-L38

That explains a lot of things.

The other question was that even if we were to move to using mp.Queue() , I wouldn’t be able to control workers retaining access to the transferred tensors, so would be curious to know the best practices around that.

You dont actually need to hold onto those Tensors, I was just hoping that it’d give some debugging insights.

Yes, I think that is indeed the issue - we are using multiprocessing.ForkingPickler and not the torch version. Let me see if I can try to fix this in a PR.

Does it point to some bug, in that case?

Does it point to some bug, in that case?

No, i think the bug was just that we didn’t override / support mp.Manager().Queue(), that’s all.

There are actually two issues here - one is that mp.Manager().Queue() has a different behavior than mp.Queue() in that it throws an invalid device pointer (regardless of the fix below). As I debugged a bit more, it seems that it is using the correct ForkingPickler from torch.multiprocessing, so the reason why it fails is not obvious to me. :frowning_face:

The second issue which arises when the worker finishes early, exists for mp.Queue() too, and I was able to use this suggestion from @colesbury to resolve it - i.e. using an mp.Event() to keep the worker process alive until all tensors are fetched in the main process.

2 Likes