Error: void THCudaTensor_gatherKernel() failed

edowson · March 22, 2019, 4:47am

I’ve adapted the PyTorch DQN tutorial to take inputs from a ROS camera topic, and use that as the observation to create an ARDrone DDQN example, using the Gazebo 7 simulator, but I’m getting a crash.

I’ve debugged the image, to ensure that the rescaled image is working properly before feeding it to the model:

processed_image

I’ve set the following hyperparameters:

  batch_size: 128
  target_network_update_interval: 2

It throws up the following error, at line 192:

https://github.com/edowson/alphapilot_openai_ros/blob/master/ardrone_race_track/src/ardrone_v1_start_training_ddqn.py#L192


  File "/project/ros-kinetic-alphapilot/catkin_ws/src/alphapilot_openai_ros/ardrone_race_track/src/ardrone_v1_start_training_ddqn.py", line 423, in <module>
    main()
  File "/project/ros-kinetic-alphapilot/catkin_ws/src/alphapilot_openai_ros/ardrone_race_track/src/ardrone_v1_start_training_ddqn.py", line 398, in main
    agent.train()
  File "/project/ros-kinetic-alphapilot/catkin_ws/src/alphapilot_openai_ros/ardrone_race_track/src/ardrone_v1_start_training_ddqn.py", line 237, in train
    self.optimize_model()
  File "/project/ros-kinetic-alphapilot/catkin_ws/src/alphapilot_openai_ros/ardrone_race_track/src/ardrone_v1_start_training_ddqn.py", line 192, in optimize_model
    next_state_values[non_final_mask] = self.target_net(non_final_next_states).max(1)[0].detach()
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/project/ros-kinetic-alphapilot/catkin_ws/src/alphapilot_openai_ros/ardrone_race_track/src/model/dqn/dqn.py", line 103, in forward
    x = F.relu(self.bn3(self.conv3(x)))
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 339, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR
[INFO] [1553228217.202863, 5023.352000]: Shutting down node: ardrone_v1_goto_ddqn
/pytorch/aten/src/THC/THCTensorScatterGather.cu:97: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [32,0,0] Assertion `indexValue >= 0 && indexValue < src.sizes[dim]` failed.
/pytorch/aten/src/THC/THCTensorScatterGather.cu:97: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [33,0,0] Assertion `indexValue >= 0 && indexValue < src.sizes[dim]` failed.

<snip>

/pytorch/aten/src/THC/THCTensorScatterGather.cu:97: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [125,0,0] Assertion `indexValue >= 0 && indexValue < src.sizes[dim]` failed.
/pytorch/aten/src/THC/THCTensorScatterGather.cu:97: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [127,0,0] Assertion `indexValue >= 0 && indexValue < src.sizes[dim]` failed.

Process finished with exit code 1

I using python-2.7 on Ubuntu-16.04. Here are the versions of the libraries.

PYTORCH_VERSION='nightly'
CUDA_VERSION='9.0'
CUDNN_VERSION='7.3.1.20'

edowson · March 22, 2019, 8:20am

I also changed the Torch version from nightly to release 1.0.1 and cuDNN to 7.4.2.24, but I still get the same error.

CUDA_VERSION='9.0'
CUDNN_VERSION='7.4.2.24'
PYTORCH_VERSION='1.0.1'
TORCHVISION_VERSION='0.2.2'

I also reduce the batch size to 1, and get the same error. The GPU is a Titan-V with 12GB HBM2 memory, and total GPU utilization is around 27%, so it isn’t an issue with running an older GPU.

  File "/project/ros-kinetic-alphapilot/catkin_ws/src/alphapilot_openai_ros/ardrone_race_track/src/ardrone_v1_start_training_ddqn.py", line 192, in optimize_model
    next_state_values[non_final_mask] = self.target_net(non_final_next_states).max(1)[0].detach()
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/project/ros-kinetic-alphapilot/catkin_ws/src/alphapilot_openai_ros/ardrone_race_track/src/model/dqn/dqn.py", line 102, in forward
    x = F.relu(self.bn2(self.conv2(x)))
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py", line 862, in relu
    result = torch.relu(input)
RuntimeError: CUDA error: device-side assert triggered
/pytorch/aten/src/THC/THCTensorScatterGather.cu:97: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [0,0,0] Assertion `indexValue >= 0 && indexValue < src.sizes[dim]` failed.