DDP evaluation illegal memory access error during forward path

Hi,

I have a machine with 2GPUs and I was trying to use mp.spawn to create 2 subprocesses to run DDP evaluation. In master process, I created a CPU model and in subprocess I converted the cloned cpu model to corresponding GPU based on rank. However, I ran into an illegal memory access error at the 2nd iteration in the second subprocess.

Environment

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 455.32.00    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    On   | 00000000:17:00.0 Off |                  N/A |
| 28%   29C    P8     5W / 180W |      2MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    On   | 00000000:B3:00.0 Off |                  N/A |
| 27%   33C    P8     6W / 180W |     19MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
Name: torch
Version: 1.9.1+cu111

Code Snippet

def evaluate_ddp(rank, world_size, model):
    print('world size', world_size)
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12328"

    # create default process group
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    batch_size = 64
    num_workers = 4

    imagenet_dir = '/usr/local/workspace/datasets/imagenet/ILSVRC2012_PyTorch/'
    val_dir = os.path.join(imagenet_dir, 'val')

    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])

    val_set = datasets.ImageFolder(
        val_dir,
        transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            normalize,
        ]))

    #val_sampler = DistributedSampler(dataset=val_set)
    val_loader = torch.utils.data.DataLoader(val_set,
                                             batch_size=batch_size,
                                             shuffle=False,
                                             num_workers=num_workers,
                                             #sampler=val_sampler,
                                             pin_memory=True)

#    model = torchvision.models.resnet18(pretrained=True)
    #model.metric = metric
    #print(f'rank {rank}', next(model.parameters()).device)
    model = model.to(rank)
    #model = DDP(model, device_ids=[rank])

    #model.eval()
    group = dist.new_group([0, 1])
    with torch.no_grad():
        for i, (images, target) in enumerate(val_loader):
            # compute output
            images = images.to(rank, non_blocking=False)
            target = target.to(rank, non_blocking=False)
            print(f"image device: {images.device}, target device: {target.device}, model device: {next(model.parameters()).device}")

            _ = model(images)
            print_freq = 10
            if i % print_freq == 0:  
                print(f"batch {i}, rank {rank}")

    print(f'Done rank: {rank}')
    dist.destroy_process_group()

model = MyModel() # assign model to cpu
mp.spawn(evaluate_ddp, args=(world_size, model), nprocs=world_size, join=True)

Terminal output

image device: cuda:0, target device: cuda:0, model device: cuda:0
image device: cuda:1, target device: cuda:1, model device: cuda:1
THCudaCheck FAIL file=../aten/src/THC/THCCachingHostAllocator.cpp line=278 error=700 : an illegal memory access was encountered
batch 0, rank 0
image device: cuda:0, target device: cuda:0, model device: cuda:0
image device: cuda:0, target device: cuda:0, model device: cuda:0
image device: cuda:0, target device: cuda:0, model device: cuda:0
image device: cuda:0, target device: cuda:0, model device: cuda:0
image device: cuda:0, target device: cuda:0, model device: cuda:0
image device: cuda:0, target device: cuda:0, model device: cuda:0
image device: cuda:0, target device: cuda:0, model device: cuda:0
Traceback (most recent call last):
  File "minimal_example.py", line 81, in <module>
    mp.spawn(evaluate_ddp, args=(world_size, model), nprocs=world_size, join=True)
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/usr/local/workspace/aimet_ddp/minimal_example.py", line 68, in evaluate_ddp
    _ = model(images)
  File "/usr/local/lib/python3.8/dist-packages/torch/fx/graph_module.py", line 513, in wrapped_call
    raise e.with_traceback(None)
RuntimeError: CUDA error: an illegal memory access was encountered

What’s weird about it is this seems to be model dependent… I didn’t run into any issues when running torchvision.models.resnet18 but the problem happens only when I switch to my own model which has a custom op. I would appreciate it if you could shed some light into this! (e.g. how to further debug, how to implement custom model which supports DDP)

Thank you in advance!

Another piece of information - if I instantiate my model on cuda:0 then spawn the subprocesses the memory error goes away. But in this case subprocess 1 (cuda:1) runs significantly slower than subprocess0 (cuda:0). Again, this only happens on my custom model not on torchvision models.

Could you check the stacktrace via cuda-gdb or compute-sanitizer, please?
If you get stuck, could you post a minimal, executable code snippet which reproduces the error?