Freezing problem while using cuda tensor in multiprocessing environment

The main process in the below code snippet will freeze after serveral iterations.

I think it’s related to how the data structure stack in pytorch works, tensors are built upon storage classes, and storage classes are build upon raw cudaMalloc regions. I can understand what recuce_tensor() and rebuild_cuda_tensor() are doing, but I am not sure why creating a new tensor (since g_tensor has been reassigned) after a blocking starmap would somehow causing the main process to freeze.

Code example

import itertools as it
import torch as t
import torch.multiprocessing as mp


def infer(id, tensor):
    print(id)
    print(tensor)
    # del tensor immediately doesn't solve the problem
    del tensor

# some global tensor
g_tensor = t.full([1000, 1000], 2, device="cuda:0")
g_tensor.share_memory_()

if __name__ == "__main__":
    ctx = mp.get_context("spawn")
    pool = ctx.Pool(2)
    for i in range(10000000):
        print("start")
        pool.starmap(infer, zip(range(5), it.repeat(g_tensor)))

        # cpu tensors work just fine
        # for cuda tensors:
        # if I delete the global tensor, reassign it with a new cuda tensor
        # or if I use a tensor created dynamically in each iteration
        # the program freezes after 2 iterations.
        # Comment out the following lines and everything will work fine.
        del g_tensor
        g_tensor = t.full([1000, 1000], 2, device="cuda:0")
        g_tensor.share_memory_()

Environment

  • PyTorch Version (e.g., 1.0): 1.1.0
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): pip
  • Python version: 3.5
  • CUDA/cuDNN version: 9.1/7.2.1

@mrshenli
I am sorting out my framework today and as of pytorch version 1.5.0, this problem still persists. Is it normal?

Tried this locally, it hangs with a non-deterministic number of iterations (4, 11, etc.), and it hangs at del g_tensor. I suspect it is due to CUDACachingAllocator, where it might need to keep the memory block alive until all other processes finish using it? But it is a per-process data structure, so it does not have the global view. I am not sure if this is the reason for this hang.

Call for help cc @ptrblck @colesbury

1 Like

Thanks for following up on this. I filed an issue to track the problem on GitHub:

It looks like the deadlock is in CudaIPCSentData destructor, which looks separate from the caching allocator code.

3 Likes