Multiprocessing process started on gpu-1 copied to gpu-0 when printing tensor

I noticed a strange behavior when using multiprocessing. My main process sends data to a queue. Two spawned processes read from the queue and create cuda tensors. If I print the tensors I see the process running on gpu-1 copied to gpu-0 in nvidia-smi. Hence it looks as if I had 3 processes, 2 on gpu-0 and 1 on gpu-1. Two of this 3 processes have the same p-id, namely the p-id of the process running on gpu-1. Now if I print the tensor size or anything else, or if I don’t print anything at all I end up having only 2 processes each running on one gpu. The behavior is reproducible using multiprocessing as well as torch.multiporcessing. See code below for reproduction.
Does anyone have an idea why this is?

nvidia-smi output when printing tensor:

nvidia-smi output otherwise:


import torch
import torch.multiprocessing as mp

def run(q, dev):
    t = torch.tensor([1], device=dev)
    for data in iter(q.get, None):
        new_t = torch.tensor([data], device=dev)
        t =, new_t), dim=0)

        # Causes the Copy

        # Any of the following doesn't causes the copy


if __name__ == '__main__':
    ctx = mp.get_context('spawn')
    q = ctx.Queue()
    devices = [
        for i in range(torch.cuda.device_count())
    processes = [
        ctx.Process(target=run, args=(q, dev))
        for dev in devices
    for pr in processes:

    for d in range(1, 1000000):

    for pr in processes:

this was fixed on master