Checkpoints containing only rng_state.pth

JoeExotic · May 15, 2021, 6:01pm

Hi guys,
I’m trying to train a model based on HuggingFace (Pytorch backend) and I have a problem and I was hoping that I might get some help here (there was no answer on HF unfortunately). After a certain number of steps (~8000) my checkpoints only contain a rng_state.pth file:

** Marker unrelated **

Why could this be the case? I sync my checkpoints automatically to resum training if Colab Crashes. Unfortually out of 22 Checkpoints (1-23k) only the first 6 (1k - 7k) are valid (all required data included).

TrainingArguments(
    output_dir=/share/datasets/output_run, 
    overwrite_output_dir=True, 
    do_train=True, 
    do_eval=True, 
    do_predict=False, 
    evaluation_strategy=IntervalStrategy.STEPS, 
    prediction_loss_only=False, 
    per_device_train_batch_size=20, 
    per_device_eval_batch_size=16, 
    gradient_accumulation_steps=1, 
    eval_accumulation_steps=None, 
    learning_rate=0.0001, 
    weight_decay=0.0, 
    adam_beta1=0.9, 
    adam_beta2=0.999, 
    adam_epsilon=1e-08, 
    max_grad_norm=1.0, 
    num_train_epochs=20.0, 
    max_steps=-1, 
    lr_scheduler_type=SchedulerType.LINEAR, 
    warmup_ratio=0.0, 
    warmup_steps=0, 
    logging_dir=runs/May12_05-06-46_a600ce861ff7, 
    logging_strategy=IntervalStrategy.STEPS, 
    logging_first_step=False, 
    logging_steps=1000, 
    save_strategy=IntervalStrategy.STEPS, 
    save_steps=1000, 
    save_total_limit=3, 
    no_cuda=False, 
    seed=42, 
    fp16=True, 
    fp16_opt_level=O1, 
    fp16_backend=auto, 
    fp16_full_eval=False, 
    local_rank=-1, 
    tpu_num_cores=None, 
    tpu_metrics_debug=False, 
    debug=[], 
    dataloader_drop_last=False, 
    eval_steps=1000, 
    dataloader_num_workers=2, 
    past_index=-1, 
    run_name=cv_sm_1, 
    disable_tqdm=False, 
    remove_unused_columns=True, 
    label_names=None, 
    load_best_model_at_end=True, 
    metric_for_best_model=loss, 
    greater_is_better=False, 
    ignore_data_skip=False, 
    sharded_ddp=[], 
    deepspeed=None, 
    label_smoothing_factor=0.0, 
    adafactor=False, 
    group_by_length=True, 
    length_column_name=length, 
    report_to=['wandb'], 
    ddp_find_unused_parameters=None, 
    dataloader_pin_memory=True, 
    skip_memory_metrics=False, 
    use_legacy_prediction_loop=False, 
    push_to_hub=False, 
    resume_from_checkpoint=None, 
    _n_gpu=1, 
    mp_parameters=
)

Ty in advcane

Edit://
According to the Trainer Output the File are created:

Saving model checkpoint to /share/datasets/output_run/checkpoint-24000
Configuration saved in /share/datasets/output_run/checkpoint-24000/config.json
Model weights saved in /share/datasets/output_run/checkpoint-24000/pytorch_model.bin
Configuration saved in /share/datasets/output_run/checkpoint-24000/preprocessor_config.json

But i could not validate that using File Explorer, ls /share/datasets/output_run/checkpoint-24000/.
I also tried to search on the entire Drive !find / -name optimizer.pt but i could only find pytorch_model.bin etc on the “valid” Checkpoints i mentioned above (1-7k), and there fore only for Checkpoints < 7000.

ptrblck · May 16, 2021, 6:07am

Could you explain the entire use case a bit more, please, as I’m unsure which class is supposed to store the checkpoints and also if the issue is on Colab, HF, or a combination.
Are you able to manually store the desired files using “pure” PyTorch?

JoeExotic · May 17, 2021, 8:19am

There was a pull request for this problem, which as far as it looks solves the problem. Thanks for your time. PR