A Problem aboud using multi-gpu with a two-stage CNN model

I design a CNN model that have two stage. First stage is generating proposal and then feed them into the second stage, but it causes error in the second step.

Accroding the below error information, it seems like the second input is not correctly assigned to the multi GPU.

  File "/home/f523/guazai/sdb/rsy/cornerPoject/myCornerNet6/exp/train.py", line 212, in run_epoch
    cls, rgr = self.model([proposal, fm], stage='two')
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
    return self.gather(outputs, self.output_device)
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/cuda/comm.py", line 166, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: CUDA error: an illegal memory access was encountered

In case you are using an older PyTorch release, could you please update to the latest stable or nightly release and rerun your script? Also, run the script with CUDA_LAUNCH_BLOCKING=1 python script.py args which should provide a stack trace pointing to the failing operation.

Thank you for your response @ptrblck
I have tried CUDA_LAUNCH_BLOCKING=1 python script.py, but it’s so slow and I get nothing error though I waited for a long time.Any method to speed up?

If blocking launches are too slow, you could also directly run the script with cuda-gdb and check the backtrace. Did you check your PyTorch version and are you already using the latest one?

@ptrblck I have updated torch from 1.6 to 1.8.1.

But the error still exist and I’m not familar with cuda-gdb. Any hint or psersudo-script to help me out?

my model sciprt shows below. I want my two-stage model can support for multi batch. e.g. the batch size is 4 and every img output 128 proposal, so the proposal size in here is (4*128, 5)

    def _stage2(self, xs):
        proposal, fm = xs
        if proposal.dim()==2 and proposal.size(1) == 5:
            # train mode
            roi = roi_align(fm, proposal, output_size=[15, 15])
        elif proposal.dim()==3 and proposal.size(2) == 4:
            # eval mode
            roi = roi_align(fm, [proposal[0]], output_size=[15, 15])
        else:
            assert AssertionError(" The boxes tensor shape should be Tensor[K, 5] in train or Tensor[N, 4] in eval")
        x = self.big_kernel(roi)
        cls = self.cls_fm(x)
        rgr = self.rgr_fm(x)
        return cls, rgr

Assuming you have a locally installed CUDA toolkit you could debug the code via:

cuda-gdb --args python script.py args
...
run
...
bt

where bt would show the backtrace of and should point to the failure.
While cuda-gdb won’t block the kernel launches, note that it’s also slowing down the execution.

@ptrblck thank you for your help
I know where I am wrong. Here’s my second stage to feed input

cls, offset = self.model([proposal, fm], stage='two')

proposal is the ROI whose shape is [N, 5], the 1th dim is the batch index. e.g. The batch size is 4, the range of index is [0,1,2,3]. And fm is the feature map.

When I use the mult-gpu like 2 gpu. the proposal and fm will be split into two branch and feed into two gpu. However the batch index range still be [0,1,2,3], then cause a index error and raise gpu error.

What I do is add a line before roi_align like below:

from torchvision.ops import roi_align
proposal[:, 0] = proposal[:, 0] % fm.size(0) # this make multi-gpu work
roi = roi_align(fm, proposal, output_size=[15, 15])

Good to hear you’ve isolate the issue!
Which torchvision version are you using, as it should trigger a RuntimeError, not an illegal memory access?

I’m using 0.9.1+cu101

Hope for making torch better!