Program hangs at os.waitpid

I just ran into a very weird bug. When using multiprocessing to create a network, the program hangs indefinitely. Code is as below:

import torch
from torch import nn
import torch.multiprocessing as mp

def normalized_columns_initializer(weights, std=1.0):
    out = torch.randn(weights.size())
    out *= std / torch.sqrt(out.pow(2).sum(1, keepdim=True))
    return out

class ACNet(nn.Module):
    def __init__(self, max_addrs, num_loc, hidden_dim = 64):
        self.loc_linear = nn.Linear(hidden_dim, num_loc)

        print('start init') = normalized_columns_initializer(
  , 0.01)    #todo: this line causes deadlock somehow

        print("init done")

def create_model(dim):
    mdl = ACNet(100, dim)
    print('model is created')

if __name__ == '__main__':
    shared_model = ACNet(100, 512)
    p = mp.Process(target=create_model, args=(512,))

Interrupting the program will show that the program hangs at os.waitpid(). If either model’s second parameter is changed to a smaller number than 512, the program will pass. The same thing will happen if the two models are both created by directly invoking the constructor, or by two child processes. Does anyone have any idea about this? My environment is as below:

PyTorch version: 1.7.0+cpu
Is debug build: True
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Linux Mint 19.1 Tessa (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 7.0.0 (tags/RELEASE_700/final)
CMake version: version 3.12.3

Python version: 3.6 (64-bit runtime)
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] torch==1.7.0+cpu
[pip3] torchaudio==0.7.0
[pip3] torchvision==0.8.1+cpu
[conda] Could not collect