Hi, I’m trying to SFT LoRA tune the llama 3.2 1B Instruct
model, while having some issues with DDP.
When running with the completely same args for train, it perfectly works on single-GPU
env, but keeps stuck everytime I run on multi-GPU
env.
Below are some of my settings and errors.
[2024-12-03 13:08:22,083] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Warning: The cache directory for DeepSpeed Triton autotune, /home/dj475/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
llamafactory version: 0.9.2.dev0
Platform: Linux-5.4.0-100-generic-x86_64-with-glibc2.31
Python version: 3.10.15
PyTorch version: 2.5.1+cu124 (GPU)
Transformers version: 4.46.1
Datasets version: 3.1.0
Accelerate version: 1.0.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA RTX A6000
DeepSpeed version: 0.15.4
vLLM version: 0.6.4.post1
top.booster: auto
top.checkpoint_path: []
top.finetuning_type: lora
top.model_name: Llama-3.2-1B-Instruct
top.quantization_bit: none
top.quantization_method: bitsandbytes
top.rope_scaling: none
top.template: llama3
train.additional_target: ''
train.badam_mode: layer
train.badam_switch_interval: 50
train.badam_switch_mode: ascending
train.badam_update_ratio: 0.05
train.batch_size: 4
train.compute_type: fp16
train.create_new_adapter: false
train.cutoff_len: 131072
train.dataset:
- radiology_dataset
train.dataset_dir: data
train.ds_offload: false
train.ds_stage: none
train.extra_args: '{"optim": "adamw_torch", "ddp_find_unused_parameters": false}'
train.freeze_extra_modules: ''
train.freeze_trainable_layers: 2
train.freeze_trainable_modules: all
train.galore_rank: 16
train.galore_scale: 0.25
train.galore_target: all
train.galore_update_interval: 200
train.gradient_accumulation_steps: 4
train.learning_rate: 5e-5
train.logging_steps: 5
train.lora_alpha: 16
train.lora_dropout: 0
train.lora_rank: 8
train.lora_target: ''
train.loraplus_lr_ratio: 0
train.lr_scheduler_type: cosine
train.mask_history: false
train.max_grad_norm: '1.0'
train.max_samples: '100000'
train.neat_packing: false
train.neftune_alpha: 0
train.num_train_epochs: '3.0'
train.packing: false
train.ppo_score_norm: false
train.ppo_whiten_rewards: false
train.pref_beta: 0.1
train.pref_ftx: 0
train.pref_loss: sigmoid
train.report_to: false
train.resize_vocab: false
train.reward_model: null
train.save_steps: 100
train.shift_attn: false
train.train_on_prompt: false
train.training_stage: Supervised Fine-Tuning
train.use_badam: false
train.use_dora: false
train.use_galore: false
train.use_llama_pro: false
train.use_pissa: false
train.use_rslora: false
train.val_size: 0.15
train.warmup_steps: 0
Error
[INFO|2024-12-03 13:05:51] parser.py:355 >> Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, compute dtype: torch.float16
[WARNING|2024-12-03 13:05:51] logging.py:162 >> ddp_find_unused_parameters needs to be set as False for LoRA in DDP training.
[INFO|2024-12-03 13:05:51] parser.py:355 >> Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.float16
[INFO|2024-12-03 13:05:51] configuration_utils.py:679 >> loading configuration file config.json from cache at /home/dj475/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/9213176726f574b556790deb65791e0c5aa438b6/config.json
[INFO|2024-12-03 13:05:51] configuration_utils.py:746 >> Model config LlamaConfig { "_name_or_path": "meta-llama/Llama-3.2-1B-Instruct", "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": [ 128001, 128008, 128009 ], "head_dim": 64, "hidden_act": "silu", "hidden_size": 2048, "initializer_range": 0.02, "intermediate_size": 8192, "max_position_embeddings": 131072, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 16, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 32.0, "high_freq_factor": 4.0, "low_freq_factor": 1.0, "original_max_position_embeddings": 8192, "rope_type": "llama3" }, "rope_theta": 500000.0, "tie_word_embeddings": true, "torch_dtype": "bfloat16", "transformers_version": "4.46.1", "use_cache": true, "vocab_size": 128256 }
[INFO|2024-12-03 13:05:51] parser.py:355 >> Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.float16
[INFO|2024-12-03 13:05:51] parser.py:355 >> Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.float16
[INFO|2024-12-03 13:05:51] tokenization_utils_base.py:2211 >> loading file tokenizer.json from cache at /home/dj475/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/9213176726f574b556790deb65791e0c5aa438b6/tokenizer.json
[INFO|2024-12-03 13:05:51] tokenization_utils_base.py:2211 >> loading file tokenizer.model from cache at None
[INFO|2024-12-03 13:05:51] tokenization_utils_base.py:2211 >> loading file added_tokens.json from cache at None
[INFO|2024-12-03 13:05:51] tokenization_utils_base.py:2211 >> loading file special_tokens_map.json from cache at /home/dj475/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/9213176726f574b556790deb65791e0c5aa438b6/special_tokens_map.json
[INFO|2024-12-03 13:05:51] tokenization_utils_base.py:2211 >> loading file tokenizer_config.json from cache at /home/dj475/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/9213176726f574b556790deb65791e0c5aa438b6/tokenizer_config.json
[INFO|2024-12-03 13:05:51] tokenization_utils_base.py:2475 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|2024-12-03 13:05:52] configuration_utils.py:679 >> loading configuration file config.json from cache at /home/dj475/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/9213176726f574b556790deb65791e0c5aa438b6/config.json
[INFO|2024-12-03 13:05:52] configuration_utils.py:746 >> Model config LlamaConfig { "_name_or_path": "meta-llama/Llama-3.2-1B-Instruct", "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": [ 128001, 128008, 128009 ], "head_dim": 64, "hidden_act": "silu", "hidden_size": 2048, "initializer_range": 0.02, "intermediate_size": 8192, "max_position_embeddings": 131072, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 16, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 32.0, "high_freq_factor": 4.0, "low_freq_factor": 1.0, "original_max_position_embeddings": 8192, "rope_type": "llama3" }, "rope_theta": 500000.0, "tie_word_embeddings": true, "torch_dtype": "bfloat16", "transformers_version": "4.46.1", "use_cache": true, "vocab_size": 128256 }
[INFO|2024-12-03 13:05:52] tokenization_utils_base.py:2211 >> loading file tokenizer.json from cache at /home/dj475/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/9213176726f574b556790deb65791e0c5aa438b6/tokenizer.json
[INFO|2024-12-03 13:05:52] tokenization_utils_base.py:2211 >> loading file tokenizer.model from cache at None
[INFO|2024-12-03 13:05:52] tokenization_utils_base.py:2211 >> loading file added_tokens.json from cache at None
[INFO|2024-12-03 13:05:52] tokenization_utils_base.py:2211 >> loading file special_tokens_map.json from cache at /home/dj475/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/9213176726f574b556790deb65791e0c5aa438b6/special_tokens_map.json
[INFO|2024-12-03 13:05:52] tokenization_utils_base.py:2211 >> loading file tokenizer_config.json from cache at /home/dj475/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/9213176726f574b556790deb65791e0c5aa438b6/tokenizer_config.json
[INFO|2024-12-03 13:05:52] tokenization_utils_base.py:2475 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|2024-12-03 13:05:52] logging.py:157 >> Replace eos token: <|eot_id|>
[INFO|2024-12-03 13:05:52] logging.py:157 >> Add pad token: <|eot_id|>
[INFO|2024-12-03 13:05:52] logging.py:157 >> Loading dataset radiology_sft_instruct.json...
Thanks!