Reduce failed to synchronize?

I got this error when I compute the BCELoss, any idea what might be the reason?

Traceback (most recent call last):
File “main.py”, line 255, in
train(args)
File “main.py”, line 204, in train
train_loss, train_acc = train_epoch(train_dataloader, model, crit, optimizer, args, reverse_dictionary)
File “main.py”, line 67, in train_epoch
loss = crit(pred, label)
File “/share/data/speech/zewei/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 325, in call
result = self.forward(*input, **kwargs)
File “/share/data/speech/zewei/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/loss.py”, line 372, in forward
size_average=self.size_average)
File “/share/data/speech/zewei/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/functional.py”, line 1179, in binary_cross_entropy
return torch._C._nn.binary_cross_entropy(input, target, weight, size_average)
RuntimeError: reduce failed to synchronize: device-side assert triggered

an it comes with 128 assertion fails like the following:

/pytorch/torch/lib/THCUNN/BCECriterion.cu:30: Acctype bce_functor<Dtype, Acctype>::operator()(Tuple) [with Tuple = thrust::detail::tuple_of_iterator_references<thrust::device_reference, thrust::device_reference, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, Dtype = float, Acctype = float]: block: [0,0,0], thread: [125,0,0] Assertion input >= 0. && input <= 1. failed.
/pytorch/torch/lib/THCUNN/BCECriterion.cu:30: Acctype bce_functor<Dtype, Acctype>::operator()(Tuple) [with Tuple = thrust::detail::tuple_of_iterator_references<thrust::device_reference, thrust::device_reference, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, Dtype = float, Acctype = float]: block: [0,0,0], thread: [126,0,0] Assertion input >= 0. && input <= 1. failed.
/pytorch/torch/lib/THCUNN/BCECriterion.cu:30: Acctype bce_functor<Dtype, Acctype>::operator()(Tuple) [with Tuple = thrust::detail::tuple_of_iterator_references<thrust::device_reference, thrust::device_reference, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, Dtype = float, Acctype = float]: block: [0,0,0], thread: [127,0,0] Assertion input >= 0. && input <= 1. failed.

1 Like

Could you check that your input tensor has values that are between 0 and 1?
You could add the following line to right before your loss call:
assert (input >= 0. & input <= 1.).all()
and see if it’s ever triggered.

2 Likes

@richard i have tried you solution but it did not work for me and i got an error like
assert (input >= 0. & input <= 1.).all()
TypeError: unsupported operand type(s) for &: ‘float’ and ‘builtin_function_or_method’

3 Likes

Hi, I tried your solution but I get an error below:

TypeError: unsupported operand type(s) for &: 'float' and 'Tensor'

could you give me more advice?

1 Like

In my case, this solution did not work. So i just debug my code line by line and got some value errors. So first debug your code is better solution. This solution is effective some specific cases.

I got this error, but it was because my data had an Inf in it (between 0 and 1 is not a true error, this data usually higher than 1 and does fine).

do assert (x.data.cpu().numpy().all() >= 0. and x.data.cpu().numpy().all() <= 1.)

I had the problem too, but the inspirations above helped me.
In the end, I forgot to pass the Sigmoid function (or whatever function you use to get the probability)

prob_predict = torch.nn.Sigmoid()(logits_predict)

Assuming you have a N * M tensor, you can use the following code as a sanity check

assert(all([val >= 0 and val <= 1 for row in x.cpu().detach().numpy() for val in row]))

If this assertion fail, then you just need to normalize the output by adding a Sigmoid to x to resolve the problem. Otherwise there’s some other problems