FSDP with size_based_auto_wrap_policy freezes training

CasellaJr · March 6, 2024, 3:31pm

I am doing multi-node and multi-GPU training. Right now, I am optimizing my multi-GPU code for faster training.

torch.cuda.set_device(self.rank)
model = FSDP(model, use_orig_params=True, device_id=torch.cuda.current_device())

Basically, everything works fine, but If I modify the model using the size_based_auto_wrap_policy what happens is that training starts, but exactly after 4 epochs it stops and it does not continue. Both the GPUs (2 A40) appears to be used 100% with nvidia-smi.

my_auto_wrap_policy = functools.partial(
        size_based_auto_wrap_policy, min_num_params=100
        )
model = FSDP(model, use_orig_params=True, device_id=torch.cuda.current_device(), auto_wrap_policy=my_auto_wrap_policy)

Moreover, another problem that happens when using this wrap policy is that when killing the script with CTRL+C, the process remains alive in background, indeed If I train to run again the script I have this error:

[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.
...
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:12345 (errno: 98 - Address already in use). The server socket has failed to bind to 12345 (errno: 98 - Address already in use).