Cpp extension function won't work for distributed parallel?

Hello, I am using a cpp_extension function written with .cpp and .cu built with torch.utils.cpp_extension._get_build_directory. It works FINE with one GPU. But with distributed parallel training, an error always occurs:

terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error in: ../torch/lib/c10d/../c10d/NCCLUtils.hpp:155, unhandled cuda error, NCCL version 2.8.3
ncclUnhandledCudaError: Call to CUDA function failed.
Traceback (most recent call last):
  File "train.py", line 557, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "train.py", line 552, in main
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 247, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 205, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 166, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/workspace/train.py", line 402, in subprocess_fn
    training_loop.training_loop(rank=rank, **args)
  File "/workspace/training/training_loop.py", line 289, in training_loop
    loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, sync=sync, gain=gain)
  File "/workspace/training/loss.py", line 67, in accumulate_gradients
    gen_img, _gen_ws = self.run_G(gen_z, gen_c, sync=(sync and not do_Gpl)) # May get synced by Gpl.
  File "/workspace/training/loss.py", line 47, in run_G
    img = self.G_synthesis(ws)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 684, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/training/networks.py", line 1091, in forward
    x, num_voxels, sigma, voxel_interact, subs = block(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/training/networks.py", line 990, in forward
    x, intersect_index, min_depth, max_depth = self.fourier_feature(center, next(w_iter), num_voxels, camera_center, ray_directions)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/training/networks.py", line 740, in forward
    intersect_index, min_depth, max_depth = intersect.intersect(self.voxel_size, n_max=self.max_intersect,
  File "/workspace/torch_utils/ops/intersect.py", line 92, in intersect
    return _aabb_intersect_cuda(voxelsize, n_max, points.unsqueeze(0), ray_start, ray_dir)
  File "/workspace/torch_utils/ops/intersect.py", line 49, in forward
    inds, min_depth, max_depth = _plugin.aabb_intersect(
RuntimeError: CUDA error: an illegal memory access was encountered

/opt/conda/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 17 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Here the _plugin.aabb_intersect is my implemented cpp_extension function.

Iā€™m using PyTorch 1.8.0 with Docker.

Are there any specific points in writing cpp extensions that result in errors ONLY during multiple GPU training like this? Thanks!

I guess you are missing the deviceGuard via:

const at::cuda::OptionalCUDAGuard device_guard(device_of(tensor));

which would use the default device in your custom CUDA extension and will thus run into illegal memory accesses.

1 Like