"Destroy" TCPStore?

Hi all!

I was wondering if there is a way to destroy a TCPStore, and close sockets associated with it.
I want to destroy the current process group (created with a TCPStore instance), and initialize a new one at the same port, but, as expected, I get an “Address already in use” error.

Thank you!

Thanks for posting the question @fot, Internally when constructing the TPCStore, we have a TCPStoreWorkerDaemon to bind the ports and do create creation, so when it goes out of scope and no reference to the TCPStore created, we will call the destructor automatically and release the port resources.

Could you elaborate more on how you get the error? If you did Ctrl-C on one process, there might be hanging processes still have TCPStore as it does not have master to send the destruction signals. In this case you might need to manually kill those hanging processes. You can do it by inspecting what ports your training_script binded to by using $(ps aux | grep training_script.py | grep -v grep | awk '{print $2}') , then kill them

BTW we do have a destroy_process_group(group=None) API if you want to do it programmatically

Thank you for your answer! I would ideally like to destroy the TCPStore and TCPStoreWorkerDaemon without killing the processes that listen to the specific port. I was wondering if this is possible from the Python API. “destroy_process_group(group=None)” does not exactly do what I want, as the TCPStore is not removed from what I understand.

yeah just filed a issue about this, we don’t have a destructor or API that could call to release those ports now, tracking it here [c10d] destruction of Store objects · Issue #72025 · pytorch/pytorch · GitHub

1 Like

Ok, great, thank you!