PyTorch multiprocessing not sending CUDA tensors properly on Windows

Hi, I’m debugging why my project is unable to properly/send receive CUDA tensors on windows. I tried to create a minimal repro, but in the minimal repro I ended up getting a different bug. In my main project, the first batch that the producer process sends to the consumer is correct, and the rest are corrupted (all 0s). In this simple script below, the first item read from the queue is corrupted, and the rest are fine:

import time
import torch

def test_producer_func(queue : torch.multiprocessing.SimpleQueue):
    torch.set_default_device("cuda")
    torch.cuda.init()
    while True:
        dummy_t = torch.ones(1,30)
        torch.cuda.synchronize()
        print(f"===== TEST PROCESS PUT TENSOR with num ones: {torch.sum(dummy_t == 1)} tensor: {dummy_t}")
        queue.put(dummy_t)
        time.sleep(3)


if __name__ == '__main__':
    torch.set_default_device("cuda")
    torch.cuda.init()
    queue = torch.multiprocessing.SimpleQueue()
    producer_process = torch.multiprocessing.Process(target=test_producer_func, args=(queue,))
    producer_process.start()
    while True:
        if not queue.empty():
            item = queue.get()
            print(f"-------->>> MAIN PROCESS GOT TENSOR num ones::: {torch.sum(item == 1)} item: {item}")
        else:
            time.sleep(.001)

For some reason, the first tensor the main process reads is always corrupted, all zeroes:

> python send_cuda_test.py
===== TEST PROCESS PUT TENSOR with num ones: 30 tensor: tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], device='cuda:0')
-------->>> MAIN PROCESS GOT TENSOR num ones::: 0 item: tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0.]], device='cuda:0')
===== TEST PROCESS PUT TENSOR with num ones: 30 tensor: tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], device='cuda:0')
-------->>> MAIN PROCESS GOT TENSOR num ones::: 30 item: tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], device='cuda:0')
===== TEST PROCESS PUT TENSOR with num ones: 30 tensor: tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], device='cuda:0')
-------->>> MAIN PROCESS GOT TENSOR num ones::: 30 item: tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], device='cuda:0')
===== TEST PROCESS PUT TENSOR with num ones: 30 tensor: tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], device='cuda:0')
-------->>> MAIN PROCESS GOT TENSOR num ones::: 30 item: tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], device='cuda:0')

I used a SimpleQueue to remove the number of random variables here, since mp.Queue spins up its own threads which could complicate things. I also added a torch.cuda.synchronize() right after initializing the ones tensor to ensure that it wasn’t a sync issue.

I know that the producer process is supposed to end before the consumer process. I tested it with the main process being the producer and the child the consumer, and it has the exact same issue with the first tensor being read from the queue being all 0s.

This is with Python 3.9.13 and PyTorch: 2.1.0+cu118 on Windows 10 with an RTX 3080ti

Any insight would be much appreciated. I’m guessing there’s some nuance about sending CUDA tensors that I’m missing here.

At least in the documentation, CUDA Tensors cannot be sent (link below). I’m not entirely sure why you’re not getting an error, but I do believe that it’s an inherent (of Windows/CUDA rather than PyTorch) limitation.

Best regards

Thomas

https://pytorch.org/docs/stable/notes/windows.html#cuda-ipc-operations

1 Like