Torch.multiprocessing on CUDA turns tensors to zeros

Boris_Zhao · August 16, 2023, 6:39am

I am trying to send a CUDA tensor from main process to another process, and below is the minimum example:

import torch

def process_1(pipe: torch.multiprocessing.Queue, event: torch.multiprocessing.Event):
    a = torch.rand([100, 100], device='cuda', dtype=torch.float)
    print(f'tensor sent: {a}')
    print(torch.sum(a))
    pipe.put(a)
    event.wait()
    event.clear()
    print(f'tensor after del: {a}')

if __name__ == '__main__':
    pipe = torch.multiprocessing.Queue()
    event = torch.multiprocessing.Event()
    p = torch.multiprocessing.Process(target=process_1, args=(pipe, event, ))
    p.start()
    recv = pipe.get()
    print(f'tensor received: {recv}')
    print(torch.sum(recv))
    del recv
    event.set()

I tried to run this code on Windows, but I found that CUDA tensors that went through the shared memory (Pipe, Queue, etc.) will be turned to all-zero tensors.

tensor sent: tensor([[0.0738, 0.4750, 0.2477,  ..., 0.6408, 0.8999, 0.3425],
        [0.9697, 0.0946, 0.6554,  ..., 0.6812, 0.9557, 0.5535],
        [0.0681, 0.4022, 0.7647,  ..., 0.1023, 0.1328, 0.1847],
        ...,
        [0.8344, 0.9620, 0.5390,  ..., 0.2282, 0.6173, 0.3060],
        [0.1959, 0.5154, 0.5861,  ..., 0.3451, 0.1385, 0.2135],
        [0.3778, 0.0317, 0.0770,  ..., 0.6761, 0.7165, 0.1330]],
       device='cuda:0')
tensor(4990.6367, device='cuda:0')
tensor received: tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')
tensor(0., device='cuda:0')
tensor after del: tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')

I suspect this is because of issues related to Windows.

Any remedies now? Or do I have to switch to the Linux version of PyTorch?

tobiaswuerth · December 6, 2024, 5:38pm

same issue here. (commented)