Pytorch No Output from Fine Tuning

Anudit · June 9, 2024, 9:48pm

Hello Everyone,

I am trying to fine tune llama3 8B with custom data set in JSON format. Upon executing the command, I see that the logs file is empty, and there is nothing generated in the output directories and there is no error as well.

PFB my config that I am using and the command being used to run. The JSON format being followed is alpaca

Can you please help in letting me know where I am going wrong:

tune run lora_finetune_single_device --config torchtune_llama3.yaml

Config for single device LoRA finetuning in lora_finetune_single_device.py

using a Llama3 8B model

This config assumes that you’ve run the following command before launching

this run:

tune download meta-llama/Meta-Llama-3-8B --output-dir /tmp/Meta-Llama-3-8B --hf-token <HF_TOKEN>

To launch on a single device, run the following command from root:

tune run lora_finetune_single_device --config llama3/8B_lora_single_device

You can add specific overrides through the command line. For example

to override the checkpointer directory while launching training

you can run:

tune run lora_finetune_single_device --config llama3/8B_lora_single_device checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>

This config works only for training on single device.

Model Arguments

model:
component: torchtune.models.llama3.lora_llama3_8b
lora_attn_modules: [‘q_proj’, ‘v_proj’]
apply_lora_to_mlp: False
apply_lora_to_output: False
lora_rank: 8
lora_alpha: 16

Tokenizer

tokenizer:
component: torchtune.models.llama3.llama3_tokenizer
path: C:/GIT_External/Meta-Llama-3-8B/8B/tokenizer.model

checkpointer:
component: torchtune.utils.FullModelMetaCheckpointer
checkpoint_dir: C:/GIT_External/Meta-Llama-3-8B/8B/
checkpoint_files: [
consolidated.00.pth
]
recipe_checkpoint: null
output_dir: C:/GIT_External/Meta-Llama-3-8B/8B/
model_type: LLAMA3
resume_from_checkpoint: False

Dataset and Sampler

dataset:
component: torchtune.datasets.alpaca_dataset
train_on_input: True
format: json
data_path: C:/Products/GenAI/Training/testtrain.json
seed: null
shuffle: True
batch_size: 2

Optimizer and Scheduler

optimizer:
component: torch.optim.AdamW
weight_decay: 0.01
lr: 3e-4
lr_scheduler:
component: torchtune.modules.get_cosine_schedule_with_warmup
num_warmup_steps: 100

loss:
component: torch.nn.CrossEntropyLoss

Training

epochs: 1
max_steps_per_epoch: null
gradient_accumulation_steps: 64
compile: False

Logging

output_dir: C:/Products/GenAI/Training/Logs/
metric_logger:
component: torchtune.utils.metric_logging.DiskLogger
log_dir: ${output_dir}
log_every_n_steps: 1

Environment

device: cpu
dtype: fp32
enable_activation_checkpointing: True

Profiler (disabled)

profiler:
component: torchtune.utils.profiler
enabled: False

Anudit · June 15, 2024, 12:35pm

@ptrblck Can you please check and highlight what could be causing this issue?