CTCDecoder returns result.tokens longer than input emissions - why +2 extra tokens?

Problem Description

I’m using torchaudio.models.decoder.CTCDecoder (CPU version with flashlight backend) and noticed that the decoder returns result.tokens that is consistently 2 elements longer than the input emissions sequence length.

This causes issues when using beam_result.timesteps as indices into the original logits, because some timestep values exceed the valid range.

Code Example

from torchaudio.models.decoder import CTCDecoder

# Create decoder
decoder = CTCDecoder(
    lexicon=None,
    tokens=vocab_list,
    blank_token="<blank>",
    beam_size=5,
)

# Input emissions with shape (batch=1, time=128, num_tokens=153)
emissions = torch.randn(1, 128, 153, dtype=torch.float32)

# Decode
results = decoder(emissions)
beam_result = results[0][0]

print(f"emissions.size(1) = {emissions.size(1)}")  # 128
print(f"len(result.tokens) = {len(beam_result.tokens)}")  # 130 (!!)
print(f"beam_result.timesteps = {beam_result.timesteps}")  # Contains [128, 129]

Observed Behavior

Input:

  • emissions.shape = (1, 128, 153) — 128 time steps

Output:

  • len(result.tokens) = 130 — 130 tokens (2 extra!)
  • beam_result.timesteps contains values [..., 128, 129]

The Issue

Looking at the source code in _get_timesteps:

def _get_timesteps(self, idxs: torch.IntTensor) -> torch.IntTensor:
    """Returns frame numbers corresponding to non-blank tokens."""
    timesteps = []
    for i, idx in enumerate(idxs):
        if idx == self.blank:
            continue
        if i == 0 or idx != idxs[i - 1]:
            timesteps.append(i)  # ← Returns position in result.tokens array
    return torch.IntTensor(timesteps)

The method returns indices within result.tokens (which has length 130), not frame numbers in the original emissions (which has length 128).

When beam_result.timesteps contains values like [128, 129], these are out of bounds for indexing into emissions[:, 0:128, :].

Questions

  1. Why does result.tokens have length 130 when emissions has only 128 frames? Where do the 2 extra tokens come from?

  2. Is this expected behavior from the flashlight decoder backend?

  3. How should I correctly map beam_result.timesteps to actual frame indices in the original emissions tensor?

Additional Context

  • This happens consistently: len(result.tokens) = emissions.size(1) + 2
  • For larger inputs (e.g., 257 frames), I get 259 tokens
  • Related source code:

Environment

  • torchaudio: 2.0.0+
  • torch: 2.0.0+
  • Platform: CPU (flashlight backend)

The extra two tokens are start and end tokens

1 Like

So when flashlight decoder returns result.tokens with length 130 (128 original frames + 2 start/end tokens), the _get_timesteps() method uses these indices directly without filtering:

def _get_timesteps(self, idxs: torch.IntTensor) -> torch.IntTensor:
    """Returns frame numbers corresponding to non-blank tokens."""
    timesteps = []
    for i, idx in enumerate(idxs):
        if idx == self.blank:
            continue
        if i == 0 or idx != idxs[i - 1]:
            timesteps.append(i)  # <- appends 128, 129 for start/end tokens
    return torch.IntTensor(timesteps)

So if the start token is at position 128 and end token at position 129, these indices get added to the timesteps output. But these indices (128, 129) are outside the bounds of the original emissions tensor which has shape [batch, 128, num_tokens] (valid indices: 0-127).

What is the correct way to handle timesteps when flashlight returns these extra tokens?

I believe the start token is at position 0 and end is at position 129 as their purpose is to signify start and end.

Yes, it`s correct. Thank you. sil token is present at this positions. Source - text/flashlight/lib/text/decoder/LexiconFreeDecoder.cpp at main · flashlight/text · GitHub