How to deploy multiple instances PyTorch model API for inference on a single GPU?

I am a beginner in MLOps and I have a Python script that uses a PyTorch model (Whisper Tiny) for speech-to-text (STT). According to the model card, this model has about 39 million parameters and is very small in size compared to my GPU memory (24 GB).

I want to deploy multiple instances of this model on the same GPU and process requests in parallel, so that I can make use of the GPU memory and improve the throughput. However, when I try to do that, the requests are processed sequentially and the GPU utilization is low.

I am using FastAPI and Docker to build and run my app. I have created a Dockerfile that uses pytorch/pytorch:latest as the base image and runs the app with gunicorn. I have deployed two containers from this image, one on port 8000 and another on port 8001. When I send two concurrent requests to these containers, the first request takes 5 seconds and the second request takes 10 seconds, implying that it waits for the first request to complete.

Following is how I am running these containers:

docker run -d -p 8000:8000 eng_api
docker run -d -p 8001:8000 eng_api

Following is my Dockerfile file:

FROM pytorch/pytorch:latest
RUN pip install fastapi uvicorn transformers ...
COPY main.py /main.py
WORKDIR /
CMD ["uvicorn", "main:app", "--host=0.0.0.0", "--port=8000"]

Following is my main.py file:

import ...
@app.post("/asr")
async def asr(audio: UploadFile = File(...)):

    audio_data = await audio.read()

    dset = Dataset.from_dict({"audio": [audio_data]})

    dset = dset.cast_column("audio", Audio(sampling_rate=16000))

    audio_array = dset[0]["audio"]["array"]
    sampling_rate = dset[0]["audio"]["sampling_rate"]

    input_features = processor(audio_array, sampling_rate=sampling_rate, return_tensors="pt").input_features

    output = model.generate(input_features)
    transcription = processor.batch_decode(output, skip_special_tokens=True)[0]

    return {"transcription": transcription}

How can I resolve this issue? How can I ensure that the containers run in parallel and use the GPU resources efficiently? Is there a way to specify the GPU memory allocation for each container? Or do I need to use a different framework or tool to manage the deployment?

Typically you’re better off serving one model per GPU unless your GPU supports Multi-Instance GPU (MIG) | NVIDIA

The way to increase throughput should be instead to increase the batch size