Question about CTC gradient

In PyTorch CTC gradient takes log-probabilities as input (after log_softmax operation).

It seems that the CTC gradient formula in PyTorch (cpu version for simplicity) refers to and seems to be implementing eq. 16 from the original CTC paper.

Eq. 16 seems to compute gradients wrt logits, not log_probs, and PyTorch must compute gradients wrt log_probs, so it’s strange if they are using the same equation.

This mismatch confuses me, I must be missing something obvious because gradcheck passes well, so the formula in PyTorch must be correct…

At the risk of being the CTC person that I don’t want to be:

The key that the paper is not using log space, so we are not interested in the derivative by y, but by log y. With this, we can remind ourselves that is that log probs are logits (i.e. x.log_softmax(dim=1) == x.log_softmax(dim=1).log_softmax(dim=1) ), so the derivative w.r.t. log y will be the one w.r.t. u.

The worst thing that happens is that you project the gradient onto the tangent space of the space (not a manifold, but close) of log probs twice (in CTC loss backward and then again in log_softmax backward). Is there a speedup to be had? In theory, yes, because the backward of the log_softmax you’ll be doing in PyTorch will compute the mean that we know is zero. But will we save from not doing it in the ctc backward? I haven’t thought about it much, but I expect not much, as we would need to specify ctc to take logits and take the logsoftmax in the ctc forward (but then we don’t need the backward).

Best regards



@tom Thank you very much! Now this complex situation is clearer!

I also find the ctc backward is not consistent with the forward. It is kind of annoying that I find my network cannot converged, just because I do not use the log probability directly from the log_softmax function. I’d like to share a workaround if someone need the “true” gradient. CTCLoss gradient is incorrect · Issue #52241 · pytorch/pytorch · GitHub