Softmax/log_softmax in CTC loss

discort · January 19, 2021, 11:35am

The docs to suggest using of logarithmized probabilities for an input of CTCLoss
https://pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html#torch.nn.CTCLoss

However, there is example in pytorch repo (the DeepSpeech model) where softmax func (instead of log_softmax) is used just for evaluation but not training.

github.com

pytorch/pytorch/blob/22a34bcf4e5eaa348f0117c414c3dd760ec64b13/benchmarks/functional_autograd_benchmark/torchaudio_models.py#L135-L140


class InferenceBatchSoftmax(nn.Module):
    def forward(self, input_):
        if not self.training:
            return F.softmax(input_, dim=-1)
        else:
            return input_

So I want to clarify what should I use for training and evaluation in CTCLoss:

softmax/log_softmax for train/eval?
identity for the training and softmax/log_softmax for eval like in example I shared above?

ZdsAlpha · January 19, 2021, 12:11pm

As far as I know, for training you need log_softmax. For inference you can just do argmax. But using argmax might only give you Top-1 accuracy. If you use softmax and get top 5 scores you can get Top-5 accuracy. I think DeepSpeech model does something similar.

discort · January 19, 2021, 12:44pm

Thank you for the reply.

So for the training I need to use log_softmax it’s clear now. For the inference I can use softmax to get top k scores.

What isn’t clear is that why DeepSpeech implementation is not using log_softmax in the repo? I suppose there should be an explicit call of log_softmax in the model definition or the model calling, right? Or did I miss something?

model definition

github.com

pytorch/pytorch/blob/22a34bcf4e5eaa348f0117c414c3dd760ec64b13/benchmarks/functional_autograd_benchmark/torchaudio_models.py#L193-L245


class DeepSpeech(nn.Module):
    def __init__(self, rnn_type, labels, rnn_hidden_size, nb_layers, audio_conf,
                 bidirectional, context=20):
        super(DeepSpeech, self).__init__()
        self.hidden_size = rnn_hidden_size
        self.hidden_layers = nb_layers
        self.rnn_type = rnn_type
        self.audio_conf = audio_conf
        self.labels = labels
        self.bidirectional = bidirectional
        sample_rate = self.audio_conf["sample_rate"]
        window_size = self.audio_conf["window_size"]
        num_classes = len(self.labels)
        self.conv = MaskConv(nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=(41, 11), stride=(2, 2), padding=(20, 5)),
            nn.BatchNorm2d(32),
            nn.Hardtanh(0, 20, inplace=True),

This file has been truncated. show original

model calling

github.com

pytorch/pytorch/blob/22a34bcf4e5eaa348f0117c414c3dd760ec64b13/benchmarks/functional_autograd_benchmark/audio_text_models.py#L51-L63


model = models.DeepSpeech(rnn_type=nn.LSTM, labels=labels, rnn_hidden_size=1024, nb_layers=5,
                          audio_conf=audio_conf, bidirectional=True)
model = model.to(device)
criterion = nn.CTCLoss()
params, names = extract_weights(model)
def forward(*new_params: Tensor) -> Tensor:
    load_weights(model, names, new_params)
    out, out_sizes = model(inputs, inputs_sizes)
    out = out.transpose(0, 1)  # For ctc loss
    loss = criterion(out, targets, out_sizes, targets_sizes)
    return loss