Model crashes after putting onto GPU when using Distributed module

I’m trying to use the pytorch distributed package with mpi backend following the tutorial here When I run the following code, it takes much longer time to put the model onto GPU, after which it complains the following

Before loop
A process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

Local host: [[58560,1],0] (PID 3778)

If you are absolutely sure that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
*** glibc detected *** /home/xqding/apps/anaconda3/bin/python: munmap_chunk(): invalid pointer: 0x00007fc92f8d8950 ***
======= Backtrace: =========

Here is my code:

world = dist.init_process_group(
backend = ‘mpi’,
init_method = ‘./output/shared_file’)
rank = dist.get_rank()
world_size = dist.get_world_size()
print(‘Hello from process {} (out of {})!’.format(
dist.get_rank(), dist.get_world_size()))

with open("…/data/category_ids.pkl", ‘rb’) as file_handle:
category_ids = pickle.load(file_handle)
num_classes = len(category_ids)

with open("…/split_train/num_images_train.pkl", ‘rb’) as file_handle:
num_images_train = pickle.load(file_handle)

idx_all = range(num_images_train)
num_images_per_reps = int(num_images_train / world_size) + 1
if rank < world_size - 1:
idx_list = range(rank*num_images_per_reps, (rank+1)num_images_per_reps)
idx_list = range(rank
num_images_per_reps, num_images_train)

train_data = CD_Dataset(
root_dir = ‘…/split_train’,
idx_list = idx_list,
transform = transforms.Compose([RandomHorizontalFlip(),

print(“Rank: {}, num_image: {}”.format(rank, len(train_data)))
train_loader = DataLoader(train_data,
batch_size = 50,
num_workers= 5)

print(“Make a model”)
net = models.ResNet(models.resnet.BasicBlock, [3, 4, 6, 3],
num_classes = num_classes)
print(“Put model onto GPU”)
#net = torch.nn.DataParallel(net, device_ids=[0]).cuda()
criterion = nn.CrossEntropyLoss().cuda(rank)
optimizer = optim.SGD(net.parameters(),
lr=0.01, momentum=0.9,
weight_decay = 1e-8)

records = {‘lr’: [], ‘loss’: []}
start_time = time.time()
previous_time = time.time()
num_epoches = 10[‘mkdir’, ‘-p’, ‘./log/train/’])
log_file = open(’./log/train/log_rank_{}.txt’.format(rank), ‘w’)
print(“Before loop”)
for epoch in range(num_epoches): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(train_loader, 0):
# gettheinputs
inputs = data[‘image’]
labels = data[‘category_id’]
labels = np.array([category_ids.index(l) for l in labels])
# wrap them in Variable
inputs, labels= Variable(inputs.cuda(rank)),
# zero the parameter gradients

Be sure you have a cuda-aware MPI implementation.
I don’t know if this is the reason of your problem, but I had the same issue (putting things to gpu made it crash), though a different error message.
Check this: Segfault using cuda with openmpi