Windows & WSL2: zeroed CUDA tensors in spawned processes

taras-janea · June 13, 2025, 9:34am

Dear PyTorch Team,

Please help to find the root cause of the issue described below.

Issue description

Executing the following script using the PyTorch 2.7, CUDA 12.6:

import torch
import torch.multiprocessing as mp

def producer(shared_queue, producer_done, consumer_done):
    for i in range(1, 10):
        t = torch.tensor([i, i*2], device="cuda:0")
        print(f"#{i}: Tensor to send: {t}")
        shared_queue.put(t)
    producer_done.set()
    consumer_done.wait()

def consumer(shared_queue, producer_done, consumer_done):
    producer_done.wait()
    for i in range(1, 10):
        t = shared_queue.get()
        print(f"#{i}: Tensor received: {t}")

    consumer_done.set()

if __name__ == "__main__":

    mp.set_start_method("spawn", True)

    shared_queue = torch.multiprocessing.Queue()
    consumer_done = mp.Event()
    producer_done = mp.Event()

    producer_process = mp.Process(target=producer, args=(shared_queue, producer_done, consumer_done))
    consumer_process = mp.Process(target=consumer, args=(shared_queue, producer_done, consumer_done))
    
    producer_process.start()
    consumer_process.start()   
    producer_process.join()
    consumer_process.join()

constantly results in the same output on both Windows and WSL2 (Ubuntu 24.04):

#1: Tensor to send: tensor([1, 2], device='cuda:0')
#2: Tensor to send: tensor([0, 0], device='cuda:0')
#3: Tensor to send: tensor([3, 6], device='cuda:0')
#4: Tensor to send: tensor([4, 8], device='cuda:0')
#5: Tensor to send: tensor([ 5, 10], device='cuda:0')
#6: Tensor to send: tensor([ 6, 12], device='cuda:0')
#7: Tensor to send: tensor([ 7, 14], device='cuda:0')
#8: Tensor to send: tensor([ 8, 16], device='cuda:0')
#9: Tensor to send: tensor([ 9, 18], device='cuda:0')
#1: Tensor received: tensor([0, 0], device='cuda:0')
#2: Tensor received: tensor([0, 0], device='cuda:0')
#3: Tensor received: tensor([3, 6], device='cuda:0')
#4: Tensor received: tensor([4, 8], device='cuda:0')
#5: Tensor received: tensor([ 5, 10], device='cuda:0')
#6: Tensor received: tensor([ 6, 12], device='cuda:0')
#7: Tensor received: tensor([ 7, 14], device='cuda:0')
#8: Tensor received: tensor([ 8, 16], device='cuda:0')
#9: Tensor received: tensor([ 9, 18], device='cuda:0')

The first two received tensors are zeroed, but not only received - also the second one that’s sent is zeroed as well. No issues on native Ubuntu setup.

Some findings

Documentation says that Windows FAQ — PyTorch 2.7 documentation sharing CUDA tensors is not supported;
There is a comment from peterjc123 that sharing CUDA tensors is not supported on Windows, but that was back 2018.
Unit tests for multiprocessing are disabled on Windows: pytorch/test/test_multiprocessing.py at 56b03df6ac5b4185a2b7b92f253565500a5b51ca · pytorch/pytorch · GitHub
Open topics that refer this issue:
Torch.multiprocessing on CUDA turns tensors to zeros
PyTorch multiprocessing with CUDA sets tensors to 0 - #11 by Skirlax
Issue with CUDA tensors shared between processes
Use torch.multiprocessing.queue with cuda tensor
PyTorch multiprocessing not sending CUDA tensors properly on Windows
Open similar topics:
Best practice to share CUDA tensors across multiprocess
CUDA tensors on multiprocessing queue
DataLoader multiprocessing with Dataset returning a CUDA tensor
Sharing CUDA tensor
DataLoader: is returning CUDA tensors always bad in distributed training?
Synchronization of CUDA operations between `multiprocess` processes
Allocate cuda tensor in subprocess - #5 by florin
Multiprocessing CUDA memory - #12 by PatrickNercessian
Using CUDA IPC memory handles in pytorch - #2 by colesbury
A call to torch.cuda.is_available makes an unrelated multi-processing computation crash?
Relevant GitHub issues:
torch.multiprocessing subprocess receives tensor with zeros rather than actual data · Issue #1015 · pytorch/examples · GitHub
Cuda tensor is zero when passed through multiprocessing queue · Issue #84994 · pytorch/pytorch · GitHub
Problems with initial communication between GPUs · Issue #56771 · pytorch/pytorch · GitHub
Parameters of cuda module zero out when used in multiprocessing · Issue #109094 · pytorch/pytorch · GitHub
Data corruption when reading data as CUDA tensor from a different process · Issue #134273 · pytorch/pytorch · GitHub
Unexpected behaviour with shared modules in multiprocessing on WSL2 · Issue #112340 · pytorch/pytorch · GitHub

Thank you in advance for your assistance