Torch distributed launch & Flask Api

pemfir · March 8, 2022, 2:35am

We have built an inference pipelines that take advantage of multiple GPU cores . We pass say a single image to inference.py using a shell script and it will return some results back. we have a shell script that contains the following:

CUDA_VISIBLE_DEVICES=1,2 python3 -m torch.distributed.launch --master_port 9800 --nproc_per_node=2 inference.py

in the inference.py we have

dist.init_process_group(backend="nccl", init_method="env://")

the code loads separate models into each GPU cores and runs them and then exits. Now we want to convert this script into a flask API. This way, users can pass images to the inference.py from a front end web UI using a post command. So we made the following changes to the inference.py

@app.route("/URL", methods=["POST"])
def search_engine():
    if request.method == "POST": 
        result = run_multi_GPU_code(request)
        return jsonify(result)


if __name__ == "__main__":

    dist.init_process_group(backend="nccl", init_method="env://")
    app.run(port=8115, host="0.0.0.0", debug=True)

and i get

  File "inference.py", line 55, in <module>
    dist.init_process_group(backend="nccl", init_method="env://")
  File "/usr/local/lib64/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/usr/local/lib64/python3.8/site-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use

Any help is appreciated.

pritamdamania87 · March 8, 2022, 7:21pm

This typically means the port (I guess 8115) is already in use by a different process. You can find the process using that port by running this command and looking for a line with LISTEN:

netstate -anp | grep 8115

pemfir · March 8, 2022, 8:41pm

Thank you for your response, i have done this already, changing the port and looking if torch distributed uses 8115, and it does not and it is not seem to be related to the port. I feel it is because FLASK Api launches a server and torch distributed also launch a server and the combination of the two does not work well. Another question is that i want to run inference in a distributed fashion. That is loading several instance of a model and passing a segment of the data to them (propagate or map the models and data), then i have to collect the model predictions from all models (reduce). Does pytorch have any tool or api to support this inference pipeline ?

Symbolk · December 5, 2022, 3:31pm

Just use debug=False instead.

MohamedAliRashad · March 2, 2023, 6:09pm

@pemfir Did you find a solution to your problem (I have the same issue) ?

songruiLI · March 28, 2023, 8:00am

The main reason is that when using torch.distributed.lauch to run the model parallel on 2 devices, python generates two processes for each device, and each process runs all the lines in the script.
This would be an issue when it comes to app.run(port=8115), where all the processes would try to take over one same port to launch their own severs.
Imagine process 0 launches app.run on port 8115 first, and successfully.
Process 1 tries to use port 8115 as well, but the port is already taken by process 0.
That’s where the RuntimeError: Address already in use comes from.

I came into this issue as well, I know where it comes from but I don’t know how to solve this problem.

wnma3mz · March 30, 2023, 12:57pm

I have found a less elegant solution that others have implemented. Even though it’s crude. But it works for me.