Problem Description
I’m using torchaudio.models.decoder.CTCDecoder (CPU version with flashlight backend) and noticed that the decoder returns result.tokens that is consistently 2 elements longer than the input emissions sequence length.
This causes issues when using beam_result.timesteps as indices into the original logits, because some timestep values exceed the valid range.
Code Example
from torchaudio.models.decoder import CTCDecoder
# Create decoder
decoder = CTCDecoder(
lexicon=None,
tokens=vocab_list,
blank_token="<blank>",
beam_size=5,
)
# Input emissions with shape (batch=1, time=128, num_tokens=153)
emissions = torch.randn(1, 128, 153, dtype=torch.float32)
# Decode
results = decoder(emissions)
beam_result = results[0][0]
print(f"emissions.size(1) = {emissions.size(1)}") # 128
print(f"len(result.tokens) = {len(beam_result.tokens)}") # 130 (!!)
print(f"beam_result.timesteps = {beam_result.timesteps}") # Contains [128, 129]
Observed Behavior
Input:
emissions.shape = (1, 128, 153)— 128 time steps
Output:
len(result.tokens) = 130— 130 tokens (2 extra!)beam_result.timestepscontains values[..., 128, 129]
The Issue
Looking at the source code in _get_timesteps:
def _get_timesteps(self, idxs: torch.IntTensor) -> torch.IntTensor:
"""Returns frame numbers corresponding to non-blank tokens."""
timesteps = []
for i, idx in enumerate(idxs):
if idx == self.blank:
continue
if i == 0 or idx != idxs[i - 1]:
timesteps.append(i) # ← Returns position in result.tokens array
return torch.IntTensor(timesteps)
The method returns indices within result.tokens (which has length 130), not frame numbers in the original emissions (which has length 128).
When beam_result.timesteps contains values like [128, 129], these are out of bounds for indexing into emissions[:, 0:128, :].
Questions
-
Why does
result.tokenshave length 130 when emissions has only 128 frames? Where do the 2 extra tokens come from? -
Is this expected behavior from the flashlight decoder backend?
-
How should I correctly map
beam_result.timestepsto actual frame indices in the original emissions tensor?
Additional Context
- This happens consistently:
len(result.tokens) = emissions.size(1) + 2 - For larger inputs (e.g., 257 frames), I get 259 tokens
- Related source code:
__call__method wheredecoder.decode()is called_to_hypomethod that processes results
Environment
torchaudio: 2.0.0+torch: 2.0.0+- Platform: CPU (flashlight backend)