Torch multiprocess and share

spyroot · November 30, 2022, 3:57pm

Hi Folks,

I’m having a strange issue that I think I have already spent more than two days troubleshooting :). I guess it related to the example given.

x = queue.get()
x_clone = x.clone()
queue_2.put(x_clone)

So if tensor x is on GPU, as far as I understand, you need to clone it.

First question, if I have object A, which holds a reference for four tensors, what is the right way to clone an entire object with all tensors? i.e do you need the first move to CPU, put the queue, and then back to GPU? Because if detach.clone while tensor on GPU, the data always zeroes.

class A
self._observations → tensor
self._actions → tensor

def clone(self, dst):
        dst._observations = self._observations.detach().cpu()
        dst._actions = self._actions.detach().cpu()

then I would use

lock network_lock:
  x = A()
  network.forward(x)
  new_x = A() 
  # a new object.   x.clone() take new_x and detach() and move to CPU()
  # note that I've tried to detach() and clone() and different combinations but can't get it to work.
  # NOTE when a device is set to CPU, everything is working
  queue.put(x.clone(new_x))

My second question is if a tensor is on GPU, what is the right way to put an object?
A that holds a bunch of tensor on the GPU?

 def clone(self, dst):
        dst._observations = self._observations.detach().clone()
        dst._actions = self._actions.detach().clone()

In the second example, I notice as soon as the object goes to a queue.
If a tensor on GPU data becomes zero, the value is zero when the consumer picks up from the queue.

I see a CUDA warning and error related to grad.

[W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)

I note that with small data, I see no error.

So in producer, I didn’t compute the gradient and tried different combinations.
I did check all tensors that I serialize to a queue, and no tensor has grad attached.

Thank you.