Can someone please make it clear for me what is the difference the
loss.mean().backward()? confusion overloads
The loss is
loss = criterion(output, label)
where/when should i do l
oss.backward and in what senario should i do
does it have anything to do with batchsize and number of GPUs that are being used?
can it cause a problem or is it wrong if I use one instead of another carelessly?
If your loss is not a scalar value, then you should certainly use either
loss.sum() to convert it to a scalar before calling the backward. Otherwise, it will cause an error as in the following example:
>>> loss = criterion(m(input), target)
tensor([1.2377, 0.6949, 0.6477], grad_fn=<BinaryCrossEntropyBackward>)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/vahid/anaconda3/envs/py37/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/vahid/anaconda3/envs/py37/lib/python3.7/site-packages/torch/autograd/__init__.py", line 84, in backward
grad_tensors = _make_grads(tensors, grad_tensors)
File "/home/vahid/anaconda3/envs/py37/lib/python3.7/site-packages/torch/autograd/__init__.py", line 28, in _make_grads
raise RuntimeError("grad can be implicitly created only for scalar outputs")
RuntimeError: grad can be implicitly created only for scalar outputs
For what I know most of the loss function of pytorch will return a scalar by default unless you mannually change the
Why do you want add
.mean() on loss function, if you call the method in
torch.nn.functional it has already call
.mean(). For example, if you use
torch.nn.CrossEntropyLoss() as your loss function, and input two tensors, one is your model’s output, the other is correspond target, and the loss is also a tensor with only one element like this.
tensor(1.6975, device='cuda:1', grad_fn=<NllLoss2DBackward>)
And you can do backpropagation in your network normally.
Yes, that’s true. The important thing is that calling
loss.backward() only works on scalar values. Using the
reduction=sum can take care of this. But sometimes, one may need to use
reduction=none in order to apply different mean/sum along different dimensions, for example, applying mean over the batch but taking the sum over pixels.
Yes I agree, but these are all correct for one gpu case.
how about when we are using multiple gpu, should we stick to loss.backward() if our loss is scalar or we should use .mean.backward()?
It does not matter how many GPUs you use. If
loss is already a scalar, then you can just call backward
loss.backward() but if it is not scalar, then you can convert that to a scalar and then call backward:
loss.mean().backward(). The number of GPUs do not play any role on how to call the backward function.
Just FYI, we can also use
.mean().backward(), whenever we define the loss inside our
It usually being used for the cases that we have multiple gpus and dont want one of the GPUs be unbalance in terms of memory.
More can be found here
O interesting! So if
loss is defined inside the model, then
loss can have independent values across different GPUs, and in that case we would need to add them up (or use mean) to convert them to a scalar value, and then call
Your question is exactly the solution to my question.