I design a CNN model that have two stage. First stage is generating proposal and then feed them into the second stage, but it causes error in the second step.
Accroding the below error information, it seems like the second input is not correctly assigned to the multi GPU.
File "/home/f523/guazai/sdb/rsy/cornerPoject/myCornerNet6/exp/train.py", line 212, in run_epoch
cls, rgr = self.model([proposal, fm], stage='two')
File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
return self.gather(outputs, self.output_device)
File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/cuda/comm.py", line 166, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: CUDA error: an illegal memory access was encountered
In case you are using an older PyTorch release, could you please update to the latest stable or nightly release and rerun your script? Also, run the script with CUDA_LAUNCH_BLOCKING=1 python script.py args which should provide a stack trace pointing to the failing operation.
Thank you for your response @ptrblck
I have tried CUDA_LAUNCH_BLOCKING=1 python script.py, but it’s so slow and I get nothing error though I waited for a long time.Any method to speed up?
If blocking launches are too slow, you could also directly run the script with cuda-gdb and check the backtrace. Did you check your PyTorch version and are you already using the latest one?
my model sciprt shows below. I want my two-stage model can support for multi batch. e.g. the batch size is 4 and every img output 128 proposal, so the proposal size in here is (4*128, 5)
def _stage2(self, xs):
proposal, fm = xs
if proposal.dim()==2 and proposal.size(1) == 5:
# train mode
roi = roi_align(fm, proposal, output_size=[15, 15])
elif proposal.dim()==3 and proposal.size(2) == 4:
# eval mode
roi = roi_align(fm, [proposal[0]], output_size=[15, 15])
else:
assert AssertionError(" The boxes tensor shape should be Tensor[K, 5] in train or Tensor[N, 4] in eval")
x = self.big_kernel(roi)
cls = self.cls_fm(x)
rgr = self.rgr_fm(x)
return cls, rgr
Assuming you have a locally installed CUDA toolkit you could debug the code via:
cuda-gdb --args python script.py args
...
run
...
bt
where bt would show the backtrace of and should point to the failure.
While cuda-gdb won’t block the kernel launches, note that it’s also slowing down the execution.
proposal is the ROI whose shape is [N, 5], the 1th dim is the batch index. e.g. The batch size is 4, the range of index is [0,1,2,3]. And fm is the feature map.
When I use the mult-gpu like 2 gpu. the proposal and fm will be split into two branch and feed into two gpu. However the batch index range still be [0,1,2,3], then cause a index error and raise gpu error.
What I do is add a line before roi_align like below:
from torchvision.ops import roi_align
proposal[:, 0] = proposal[:, 0] % fm.size(0) # this make multi-gpu work
roi = roi_align(fm, proposal, output_size=[15, 15])