RendezvousClosedError in Torchrun Elastic

When I scaled down my app from 2 to 1 nodes, the rendezvous broke with “RendezvousClosedError” message. I’m sending a stop signal to the process of the node that I want to down, is that a problem? How can I test that the elasticity really works without that error?

I am launching torchrun on both machines as follows:
torchrun --nnodes=1:2 --nproc-per-node=1 --max_restarts=3 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=[IP-HOST]:29500

After one training epoch I send “kill -9 [PID]”, where PID is the worker process of node 2.

Best regards!