Device-side assert triggered after mask softmax output

Jie_Han_Chen · April 13, 2018, 6:08pm

I’ve implemented a reinforcement learning algorithm, A3C, which shared the model parameters between global model and local models.

A3C samples the action from softmax output like: policy_head.multinomial(). In my scenario, I need to mask the output of not available actions in current status. The mask variable contains FloatTensor which the value of elements is either 0.0 or 1.0. The masking code is like:

selection_mask = torch.from_numpy((state.observation['screen'][_SCREEN_PLAYER_RELATIVE] == 1).astype('float32'))
selection_mask = Variable(selection_mask.view(1, -1), requires_grad=False).cuda(args['gpu'])
select_spatial_action_prob = select_spatial_action_prob * selection_mask
select_action = select_spatial_action_prob.multinomial()
print("select action:", select_action)
select_entropy = - (log_select_spatial_action_prob * select_spatial_action_prob).sum(1)

After training few steps (around 160K steps), some error occurred. When I dived into the variable of masked select_spatial_action_prob, I found all the value in that variable is nan, and I don’t know how to fix it.

Here are my code of training part and model part.
training code: train.py · GitHub
model: model.py · GitHub

The error message is as following:

Traceback (most recent call last):
File “/home/jielite/anaconda3/lib/python3.6/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/home/jielite/anaconda3/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/jielite/pysc2-A3C/sc2/train_hierarchical.py”, line 85, in train_master
print(“select action:”, select_action)
File “/home/jielite/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py”, line 119, in repr
return ‘Variable containing:’ + self.data.repr()
File “/home/jielite/anaconda3/lib/python3.6/site-packages/torch/tensor.py”, line 144, in repr
return str(self)
File “/home/jielite/anaconda3/lib/python3.6/site-packages/torch/tensor.py”, line 151, in str
return _tensor_str._str(self)
File “/home/jielite/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py”, line 297, in _str
strt = _matrix_str(self)
File “/home/jielite/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py”, line 216, in _matrix_str
min_sz=5 if not print_full_mat else 0)
File “/home/jielite/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py”, line 79, in _number_format

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/generic/THCTensorCopy.c:70
/opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/THCTensorRandom.cuh:179: void sampleMultinomialOnce(long *, long, int, T *, T *) [with T = float, AccT = float]: block: [0,0,0], thread: [704,0,0] Assertion THCNumerics<T>::ge(val, zero) failed.
/opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/THCTensorRandom.cuh:179: void sampleMultinomialOnce(long *, long, int, T *, T *) [with T = float, AccT = float]: block: [0,0,0], thread: [705,0,0] Assertion THCNumerics<T>::ge(val, zero) failed.
/opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/THCTensorRandom.cuh:179: void sampleMultinomialOnce(long *, long, int, T *, T *) [with T = float, AccT = float]: block: [0,0,0], thread: [706,0,0] Assertion THCNumerics<T>::ge(val, zero) failed.
/opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/THCTensorRandom.cuh:179: void sampleMultinomialOnce(long *, long, int, T *, T *) [with T = float, AccT = float]: block: [0,0,0], thread: [707,0,0] Assertion THCNumerics<T>::ge(val, zero) failed.
/opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/THCTensorRandom.cuh:179: void sampleMultinomialOnce(long *, long, int, T *, T *) [with T = float, AccT = float]: block: [0,0,0], thread: [708,0,0] Assertion THCNumerics<T>::ge(val, zero) failed.

richard · April 13, 2018, 7:13pm

What version of PyTorch are you using? I know there were a few bugs around multinomial before that should be fixed on master.

Jie_Han_Chen · April 13, 2018, 9:36pm

I use pytorch 0.3.1 installed by conda with CUDA 9.0

Ricardo_Gama · May 27, 2018, 8:36am

Hello. I’m facing a similar problem.
Have you found a workaround?
Thanks.

Jie_Han_Chen · May 28, 2018, 5:35pm

This error in my case is from zero-division error, which means the sum of exp(output_i) is 0.
You could try to think how to avoid zero-division error in your problem, if your error was also caused by zero-division error.

TonyS · October 18, 2020, 4:50pm

Hi, I have a similar issue. I didn’t add a Relu() layer before Softmax(), so sometimes the sum of the previously FC output is zero, which will cause the error. Hope this will help you guys get out of this annoying bug quickly.