I guess in the first epochs the output logits were not saturated and torch.sigmoid
didn’t return a zero or one yet.
Replacing the invalid values after they were calculated won’t avoid computing invalid gradients:
predicted = torch.zeros(1, requires_grad=True)
term_a = torch.log(predicted)
term_a[torch.isinf(term_a)] = -100.
term_a.backward()
print(predicted.grad)
> tensor([nan])