How to save model state in pytorch fsdp

qibin0506 · December 24, 2024, 1:59pm

I am using PyTorch’s FSDP. How can I save the completed model state every 100 batches? Is the following code correct?

if (batch + 1) % 100 == 0:
    if isinstance(model, FSDP):
        states = model.state_dict()
        if is_main_process:
            ckpt = {'model_state_dict': states}
            torch.save(ckpt, 'ckpt.pth')
    else:
        ckpt = {'model_state_dict': model.state_dict()}
        torch.save(ckpt, 'ckpt.pth')

or method 2?

            with FSDP.summon_full_params(
                    module=model,
                    rank0_only=True,
                    writeback=False,
                    offload_to_cpu=True
            ):
                states = model.state_dict()
                ckpt = {'model_state_dict': states}
                torch.save(ckpt, 'ckpt.pth')

H-Huang · December 26, 2024, 11:05pm

Distributed Checkpoint (DCP) is the recommended utility for saving models parallelized with FSDP, TP, etc.

https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html#how-to-use-dcp

qibin0506 · December 27, 2024, 7:38am

After reviewing the document, I still have some questions.
Why do I need to use both dcp.load and model.load_state_dict if I want to load all parameters into a non-FSDP model?
Will dcp.load take effect in this case? Does model.load_state_dict load all the parameters?

def run_checkpoint_load_example():
    # create the non FSDP-wrapped toy model
    model = ToyModel()
    state_dict = {
        "model": model.state_dict(),
    }

    # since no progress group is initialized, DCP will disable any collectives.
    dcp.load(
        state_dict=state_dict,
        checkpoint_id=CHECKPOINT_DIR,
    )
    model.load_state_dict(state_dict["model"])

https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html#how-to-use-dcp