FSDP with got unequal param size

woodyNsinbad · November 3, 2025, 7:21am

Hello , I use FSDP for SFTTrainer with 2 GPUs for Qwen2.5-0.5B tuning , then I got this …

my code is:

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
)

auto_wrap_policy = functools.partial(
    transformer_auto_wrap_policy,
    transformer_layer_cls=(Qwen2DecoderLayer,),
    )

model = FSDP(
        model,
        auto_wrap_policy=auto_wrap_policy,
        device_id=torch.cuda.current_device(),
        sharding_strategy=torch.distributed.fsdp.ShardingStrategy.FULL_SHARD,  # ZeRO-3
        use_orig_params=True,
    )

why fsdp give different size of each shard?