Hi experts,
I want to train models with FSDP2, but I found mismatch of the updated parameters compared to none-FSDP model. Here is an example,
torch.manual_seed(1)
model = nn.Sequential(
nn.Sequential(nn.Linear(128, 128), nn.Linear(128, 128)),
nn.Sequential(nn.Linear(128, 128), nn.Linear(128, 128)),
nn.Linear(128, 128),
).cuda()
inputs = torch.randn(2, 2, 128).cuda()
# fsdp copy
fsdp_model = copy.deepcopy(model)
out_ref = model(inputs)
optimizer_ref = torch.optim.SGD(model.parameters(), lr=0.1)
# apply FSDP
fully_shard(fsdp_model[0])
fully_shard(fsdp_model[1])
fully_shard(fsdp_model)
out_test = fsdp_model(inputs)
optimizer_test = torch.optim.SGD(fsdp_model.parameters(), lr=0.1)
# This could pass
assert torch.allclose(out_ref, out_test)
out_ref.sum().backward()
out_test.sum().backward()
# This could pass, all grads are identical
for (n1, p1), (n2, p2) in zip(model.named_parameters(), fsdp_model.named_parameters()):
assert torch.allclose(p1, p2.full_tensor())
assert torch.allclose(p1.grad, p2.grad.full_tensor())
optimizer_ref.step()
optimizer_ref.zero_grad()
optimizer_test.step()
optimizer_test.zero_grad()
# This would fail, p1 != p2 for the last Linear layer in the model
for (n1, p1), (n2, p2) in zip(model.named_parameters(), fsdp_model.named_parameters()):
assert torch.allclose(p1, p2.full_tensor())
out_ref_1 = model(inputss)
out_test_1 = fsdp_model(inputss)
# This would fail because the last layer is different
assert torch.allclose(out_ref_1, out_test_1)
torch.multiprocessing
is used to run the code. I expect the weights of FSDP model should be identical to the non-FSDP model’s after the optimizer step, but the last linear layer is different, I can’t understand since the gradients are identical.
If I fully_shard
the last linear layer, the test could pass, but like LLM models, there’s a linear LM head outside decoders. I checked the torchtitan, fully_shard
every submodule is not expected.
Any help is appreciated, thank you!