I assume you have downloaded gtFine_trainvaltest.zip and leftImg8bit_trainvaltest.zip from https://www.cityscapes-dataset.com/downloads/ and extract to ~/BigDatas/cityscapes which has gtFine and leftImage8bit.

Reproduce steps:

git clone https://github.com/fyu/drn.git
cd drn
python3 datasets/cityscapes/prepare_data.py ~/BigDatas/cityscapes/gtFine
cp datasets/cityscapes/create_lists.sh ~/BigDatas/cityscapes
cp datasets/cityscapes/info.json ~/BigDatas/cityscapes
cd ~/BigDatas/cityscapes
sh create_lists.sh
chmod u+x segment.py

Then CUDA_VISIBLE_DEVICES=0,1 python3 segment.py train -d ~/BigDatas/cityscapes -c 19 -s 896 --arch drn_d_22 --batch-size 32 --epochs 250 --lr 0.01 --momentum 0.9 --step 100 works very well. However, CUDA_VISIBLE_DEVICES=0,1 CUDA_LAUNCH_BLOCKING=1 python3 segment.py train -d ~/BigDatas/cityscapes -c 19 -s 896 --arch drn_d_22 --batch-size 32 --epochs 250 --lr 0.01 --momentum 0.9 --step 100 will freeze like below:

I am doubting that I am the first man who meet the problem of “train with CUDA_LAUNCH_BLOCKING=1 will freeze” in the world, I am helpless~

Ubuntu 16.04, four GTX 1080, PyTorch 0.4 which is installeld from pip install torch.

I debug by myself now, I will try my best to reduce the range of the bug or the problem.

I simplify my project to a file segment.py:

#!/usr/bin/env python

import torch
from torch import nn
import torch.utils.data

class DRN(nn.Module):
    def __init__(self):
        super(DRN, self).__init__()
        self.a = nn.Conv2d(3, 16, kernel_size=7)

    def forward(self, x):
        print('before DRN forward')
        return x

if __name__ == '__main__':
    model = torch.nn.DataParallel(DRN()).cuda().train()
    input_ = torch.rand(2).cuda()
    print('before input')

If I run CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0,1 ./segment.py, It will stucks after print before input.

However, if I change rand(2) to rand(1), it does not stuck again.

1 Like

I report the bug here: https://github.com/pytorch/pytorch/issues/9163

CUDA_LAUNCH_BLOCKING=1 does not work with DataParallel actually.


what does CUDA_LAUNCH_BLOCKING=1 means?

This env variable will synchronize the kernel launches, so that the stacktrace would point to the right line of code in case a kernel is hitting an internal assert. Otherwise, due to the async execution of CUDA kernels, the errors might be reported in another line of code since the CPU could run ahead.

1 Like

I encountered the same problem. Finally, I replaced “CUDA_VISIBLE_DEVICES=2,3” with “CUDA_VISIBLE_DEVICES=3” to fix the problem.