Drop row from tensor in cuda

Augusto_Maillo · September 12, 2024, 6:11pm

I’m using Whisper by transformers library to transcribe some audios. In my application I need to transcribe near real time, so I use batching to speed up. However whisper generation is sequential, and some samples in batch may (and probably will) finished later. I don’t want to hold the whole batch while a lazy sample is not finished. I’m using a Streamer to early return completed predictions, but I’m wondering if drop them from the batch for the next predictions should speed my inference.

For example, lets say a N size batch. If I identify that sample i finished (by end of string token), I will return the transcription for the requester, but for the next whisper model predictions I want a n inference over a N-1 sized batch.

Is there a faster way that I can do this with a cuda Tensor? Do you think it’s worthy?

ptrblck · September 12, 2024, 7:30pm

I don’t fully understand this concept, so could you explain how a single sample can “finish” before the entire batch?

In standard PyTorch layers, the entire batch is processed at once and thus the returned output is also containing the result for all samples.
However, it seems your use case processes the samples inside a batch independently?

Augusto_Maillo · September 13, 2024, 8:00pm

In Whisper generation there are multiple inference steps in order to predict the next sentence token.

For a batch with 3 samples, it will predict at each stop a new token for the sentence. However, one sample in the batch may finish at step i, where the predicted token for its is end_of_string_token. Whisper will keep this sample in the batch for the next iterations, even if the prediction for it is over.

I’m wondering if I can drop this sample from the batch, so the next inference step will be compute with N-1 samples. It may seem worthless since remove a single sample should not improve speed a lot, but if I drop continously, in a batch_size=64 sample where half of them finished before the others, the last inferences will be computed over 32 samples, and that will make a lot of difference.

That said, in order to do this, I need to modify a torch tensor in GPU very quickly.

input_tensor = torch.rand((64, 1024)).to('cuda')
inference_finished:
while not inference_finished:
  result = predict(input_tensor)
  input_tensor = drop_concluded_predictions(input_tensor) 
  inference_finished = check_predictions(result)

In the ith iteration, input_tensor.shape[0] will be lower then 64.

There is a way to do this very quickly?

ptrblck · September 14, 2024, 2:31pm

You could profile different approaches of slicing and creating a new tensor using the remaining inputs via torch.cat or by moving the remaining inputs. Both approaches would trigger a copy so you could then profile how much performance you are gaining by reducing the workload vs. triggering the copies.