Small PR for fixing CUCTC top-k shared memory offset #4198

Fixes #4181

This fixes the shared-memory layout used by first_matrix__bitonic_topk_kernel.

block_topk_key is a float*, so pointer arithmetic is in units of float, not bytes. The previous code computed block_topk_key + sizeof(float) * beam, which advanced the value buffer by sizeof(float) * beam float elements instead of by beam float elements.

The launch allocates the result region as:

beam * sizeof(float) + beam * sizeof(int)

PR: Fix CUCTC top-k shared memory offset by lanarkite99 · Pull Request #4198 · pytorch/audio · GitHub

Since the torchaudio repo is no longer actively monitored, I wanted to share the PR here for visibility.

Thanks