FSDP2 post backward hook registration

lovanto · January 31, 2026, 8:24pm

Hi

In FSDP2 module’s post backward hook is registered through an autograd function which runs the hook when the gradients of the module’s inputs have been computed.

github.com/pytorch/pytorch

torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py

efae01570


      
          @contextlib.contextmanager
          def use_training_state(self, training_state: TrainingState):
              old_training_state = self._training_state
              self._training_state = training_state
              try:
                  yield
              finally:
                  self._training_state = old_training_state
          
          # Hook Registration #
          def _register_post_backward_hook(
              self, args: tuple[Any, ...], kwargs: dict[str, Any]
          ) -> tuple[tuple[Any, ...], dict[str, Any]]:
              # Traceable FSDP2 relies on `root_post_backward_callback` to call each
              # `FSDPParamGroup.post_backward`
              if (not torch._dynamo.config.skip_fsdp_hooks) or compiled_autograd_enabled():
                  return args, kwargs
              if not torch.is_grad_enabled():
                  return args, kwargs
              # Collect all tensors that require gradients
              inp_tensors: list[torch.Tensor] = []

But it seems like the hook assumes that at the moment it is run the gradients of the module’s weights have also been computed. Is it actually guaranteed that weight gradients are ready at the moment input gradients are? If yes, how so?

ptrblck · January 31, 2026, 8:51pm

I would assume all gradients were computed and are ready if you are using a post_backward hook since the hook will be called after the backward call. Let me know if I misunderstood your question.

lovanto · January 31, 2026, 9:07pm

Well the hook is registered on the input tensors, but not the weights, so it’s not clear to me that at that point the weight gradients will be also computed and accumulated.

I don’t know much about the autograd engine, but it seems to me that it might be possible for it to be in a state when some part of the backward graph is already computed (which includes input tensor nodes in this case) but another part of the graph (which includes weight nodes) is not computed yet.