No gradient update when using fsdp with hugginface accelerate

I am training a decoder LLM (GPT2 family). I converted code to accelerator following the instructions followed by huggingface.

When the accelerator config is without fsdp, it works fine. But when I enable fsdp in accelerator config, the gradient update is not happening, the model is static. Providing the non fsdp and fsdp config below.
I have looked the code and searched blogs, issues but unable to find the issue.
What could be the problem ? Any sugestions ?

environment variables:
Python== 3.9
transformers==4.28.1
accelerate==0.18.0
torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117 \

I assume FSDP enable code should also with single GPU.

Non-FSDP
compute_environment: LOCAL_MACHINE
distributed_type: ‘NO’
downcast_bf16: ‘no’
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: ‘no’
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

FSDP config

compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: ‘no’
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_offload_params: false
fsdp_sharding_strategy: 2
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: CodeGenBlock
machine_rank: 0
main_training_function: main
mixed_precision: ‘bf16’
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_ip: x.x.x.x
main_process_port: 53254

The key steps used in accelerator migration are:
from accelerate import Accelerator
from accelerate.utils import InitProcessGroupKwargs

from torch.distributed.fsdp import (
FullyShardedDataParallel as FSDP,
StateDictType,
FullStateDictConfig
)
from accelerate import FullyShardedDataParallelPlugin
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig


initialize

launch_timeout = timedelta(seconds=config["ACCELERATOR"]["torch_launcher_to"])
accelerator_log_kwargs = {
        'kwargs_handlers':[InitProcessGroupKwargs(timeout=launch_timeout)],
        'gradient_accumulation_steps': config["ACCELERATOR"]['gradient_accumulation_steps'],
        }
accelerator_log_kwargs["log_with"] = "tensorboard"
accelerator_log_kwargs["project_dir"] = dirs["logs_dir"]

## Initialize the accelerator
accelerator = (
        Accelerator(**accelerator_log_kwargs)
        )

model = accelerator.prepare(model) ### added for acclerator
train_dataloader, optimizer, lr_scheduler, device = accelerator.prepare( train_dataloader, optimizer, lr_scheduler, device) ### added for acclerator


    loss = outputs.loss
    train_variables_dict['total_loss'] += loss.detach().float()
    optimizer.zero_grad()
    # loss.backward()
    # modified for accelerator
    accelerator.backward(loss) 
    optimizer.step()
    real_lr=optimizer.param_groups[0]['lr']
    lr_scheduler.step()

Could you update PyTorch to the latest stable (or better to the latest nightly) release and check if you would still see the same behavior?