Load multiple models for multi process inference

here is my code ,it worked in parallel, but each inference took a hundredfold longer.

without multi process ,each inference cost 20ms,
with multi process, each inference cost 25s.

def load_model():
    conf_obj = _get_config_obj("tests/ocr/ocr.yml")
    model_manange = OCRModel(conf_obj)
    model_manange.load_model()
    model_manange.status = MODEL_MANAGE_IDLE
    return model_manange


def init_multi_load() -> None:
    for i in range(3):
        model_manage_list.append(load_model())


def infer(data: dict, model_manage_list) -> dict:
    for model_manage in model_manage_list:
        if model_manage.status == MODEL_MANAGE_IDLE:
            model_manage.status = MODEL_MANAGE_WORKING
            ret = model_manage.run_model(data)
            model_manage.status = MODEL_MANAGE_IDLE
            return ret
    raise ValueError("no idle model for service")


if __name__ == '__main__':
    torch.multiprocessing.set_start_method('spawn', force=True)
    model_manage_list = Manager().list()
    init_multi_load()
    data = {} # just for show case here
    for i in range(3):
        mp = Process(target=infer, args=(data, model_manage_list,))
        mp.start()

    while True:
        time.sleep(2)

What hardware is running the model? If the processes are e.g., sharing a single GPU the increased contention could slow things down

:pray: i got just single GPU in my computer


Is there a way to parallelize on single GPU ? I think my GPU performance is very redundant

How similar are the different models on your GPU? If they are similar and have relatively simple building blocks, you might look into e.g., using batched matmul in place of linear layers torch.bmm — PyTorch 1.13 documentation and grouped convolutions in place of “vanilla” convolutions to do multiple models worth of computation at the time/per layer. Of course, you would need to be careful about keeping normalization statistics separate across the models.

Hi Dear , Exactly the same model, torch.bmm doesn’t work for me. Do you have such an example ? that load multi model in one process and inference with subprocess parallel.