Fixes #4181
This fixes the shared-memory layout used by first_matrix__bitonic_topk_kernel.
block_topk_key is a float*, so pointer arithmetic is in units of float, not bytes. The previous code computed block_topk_key + sizeof(float) * beam, which advanced the value buffer by sizeof(float) * beam float elements instead of by beam float elements.
The launch allocates the result region as:
beam * sizeof(float) + beam * sizeof(int)
PR: Fix CUCTC top-k shared memory offset by lanarkite99 · Pull Request #4198 · pytorch/audio · GitHub
Since the torchaudio repo is no longer actively monitored, I wanted to share the PR here for visibility.
Thanks