“For one, if either yn=0y_n = 0yn=0 or (1−yn)=0(1 - y_n) = 0(1−yn)=0 , then we would be multipying 0 with infinity. Secondly, if we have an infinite loss value, then we would also have an infinite term in our gradient, since limx→0ddxlog(x)=∞\lim_{x\to 0} \frac{d}{dx} \log (x) = \inftylimx→0dxdlog(x)=∞ . This would make BCELoss’s backward method nonlinear with respect to xnx_nxn , and using it for things like linear regression would not be straight-forward.”

my question is on the last sentence; how does a loss term and its gradient going to infinity translate into to a nonlinear backward method and its incompatibility with linear regression?

I’ve noticed that sentence too. I think it’s pretty much nonsense, where
either the author was just sloppy or didn’t understand what the words
he was using meant.

The best I can come up with is that he wanted to say something like:

“When BCELoss becomes inf, its backward() method – and the
whole backpropagation – will become polluted with infs and nans,
causing training to fail. We therefore clamp BCELoss's internal log()
function at -100 to save you from such an ignoble fate.”

Yes exactly. your translation of the author’s words do make all the sense and it’s also what i would have expected to be written there. Those words about nonlinearity of backprop just made me feel there’s whole other dimension of backward loss propagation that i have no clue about. Relieved to know I’m not the only one.