The main process in the below code snippet will freeze after serveral iterations.
I think it’s related to how the data structure stack in pytorch works, tensors are built upon storage classes, and storage classes are build upon raw cudaMalloc regions. I can understand what
rebuild_cuda_tensor() are doing, but I am not sure why creating a new tensor (since g_tensor has been reassigned) after a blocking starmap would somehow causing the main process to freeze.
import itertools as it import torch as t import torch.multiprocessing as mp def infer(id, tensor): print(id) print(tensor) # del tensor immediately doesn't solve the problem del tensor # some global tensor g_tensor = t.full([1000, 1000], 2, device="cuda:0") g_tensor.share_memory_() if __name__ == "__main__": ctx = mp.get_context("spawn") pool = ctx.Pool(2) for i in range(10000000): print("start") pool.starmap(infer, zip(range(5), it.repeat(g_tensor))) # cpu tensors work just fine # for cuda tensors: # if I delete the global tensor, reassign it with a new cuda tensor # or if I use a tensor created dynamically in each iteration # the program freezes after 2 iterations. # Comment out the following lines and everything will work fine. del g_tensor g_tensor = t.full([1000, 1000], 2, device="cuda:0") g_tensor.share_memory_()
- PyTorch Version (e.g., 1.0): 1.1.0
- OS (e.g., Linux): Linux
- How you installed PyTorch (conda, pip, source): pip
- Python version: 3.5
- CUDA/cuDNN version: 9.1/7.2.1