I am training a decoder LLM (GPT2 family). I converted code to accelerator following the instructions followed by huggingface.
When the accelerator config is without fsdp, it works fine. But when I enable fsdp in accelerator config, the gradient update is not happening, the model is static. Providing the non fsdp and fsdp config below.
I have looked the code and searched blogs, issues but unable to find the issue.
What could be the problem ? Any sugestions ?
environment variables:
Python== 3.9
transformers==4.28.1
accelerate==0.18.0
torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117 \
I assume FSDP enable code should also with single GPU.
Non-FSDP
compute_environment: LOCAL_MACHINE
distributed_type: ‘NO’
downcast_bf16: ‘no’
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: ‘no’
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
FSDP config
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: ‘no’
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_offload_params: false
fsdp_sharding_strategy: 2
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: CodeGenBlock
machine_rank: 0
main_training_function: main
mixed_precision: ‘bf16’
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_ip: x.x.x.x
main_process_port: 53254
The key steps used in accelerator migration are:
from accelerate import Accelerator
from accelerate.utils import InitProcessGroupKwargs
from torch.distributed.fsdp import (
FullyShardedDataParallel as FSDP,
StateDictType,
FullStateDictConfig
)
from accelerate import FullyShardedDataParallelPlugin
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
initialize
launch_timeout = timedelta(seconds=config["ACCELERATOR"]["torch_launcher_to"])
accelerator_log_kwargs = {
'kwargs_handlers':[InitProcessGroupKwargs(timeout=launch_timeout)],
'gradient_accumulation_steps': config["ACCELERATOR"]['gradient_accumulation_steps'],
}
accelerator_log_kwargs["log_with"] = "tensorboard"
accelerator_log_kwargs["project_dir"] = dirs["logs_dir"]
## Initialize the accelerator
accelerator = (
Accelerator(**accelerator_log_kwargs)
)
model = accelerator.prepare(model) ### added for acclerator
train_dataloader, optimizer, lr_scheduler, device = accelerator.prepare( train_dataloader, optimizer, lr_scheduler, device) ### added for acclerator
loss = outputs.loss
train_variables_dict['total_loss'] += loss.detach().float()
optimizer.zero_grad()
# loss.backward()
# modified for accelerator
accelerator.backward(loss)
optimizer.step()
real_lr=optimizer.param_groups[0]['lr']
lr_scheduler.step()