Equivalence between CTC loss gradients

Kiriakos_Shiarli · November 29, 2018, 11:12am

I am working on a variant of the CTC loss. It is my understanding that one can compute the forward variables (in log space) and simply perform a logsumexp operation in the last two alpha variables of the sequence at the last timestep. This gives us the log-likelihood of the sequence under the model.

My current implementation is in pytorch and takes this as the loss and simply calls backward to get the gradients. I.e I don’t have a specific implementation for the backward pass.

Looking at the native cuda implementation here: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/LossCTC.cu#L9, the full forward-backward recursion used in the paper is performed.

The question is: Will the gradients in the above implementation be equivalent to performing the forward pass and then using autograd only (much) slower? Or is there something in the backward recursion that will produce different gradients. If so, what would that be?

Already talked with @tom about this but it wasn’t clear to me whether or not a pure pytorch implementation of the forward pass and then loss.backward() would suffice.

tom · November 30, 2018, 9:10am

My impression was that the implied CTC backward has numerical stability problems, but in the end I didn’t dig into it too much.
You can just implement the CTC function, call it with double and call gradcheck if you’re interested in exploring this.

Best regards

Thomas

Kiriakos_Shiarli · November 30, 2018, 11:49am

Ah gradcheck sounds like a good idea! thanks!

vadimkantorov · January 3, 2020, 5:12pm

@Kiriakos_Shiarli I implemented a CTC that computes gradients via pure autograd (it’s quite slow):

If you replace logadd by PyTorch logsumexp and use 1e-16 instead of float(’-inf’), double backward may just happen to work