Model not saving

AfonsoSalgadoSousa · February 25, 2022, 11:16am

Hi. In a standard train script, I have the following partial function:

best_model_handler = partial(Checkpoint,
      {"model": model},
      DiskSaver(dirname=cfg.work_dir.as_posix(), require_empty=False),
      filename_prefix="best",
      n_saved=2,
      global_step_transform=global_step_from_engine(trainer),
  )

If I want to evaluate, I do this:

best_model_handler = best_model_handler(
            score_name="val_bleu",
            score_function=Checkpoint.get_default_score_fn("bleu"),
        )
evaluator.add_event_handler(Events.COMPLETED, best_model_handler)

This works fine. On the other hand, when I try not to validate, I do the following:

best_model_handler = best_model_handler(
            score_name="train_loss",
            score_function=Checkpoint.get_default_score_fn("loss"),
        )
trainer.add_event_handler(Events.COMPLETED, best_model_handler)

But it throws the following error:

File “/home/admin/anaconda3/envs/catbird/lib/python3.8/site-packages/ignite/handlers/checkpoint.py”, line 648, in wrapper
return score_sign * engine.state.metrics[metric_name]
KeyError: ‘loss’

However, if I do not define score_name or score_function, the script runs fine, but the model is not saved. How can I fix this? Thanks in advance for any help you can provide.

vfdev-5 · February 26, 2022, 8:35am

@AfonsoSalgadoSousa thanks for reporting, I’ll check what happens in details and write back here.

Few remarks, score_function=Checkpoint.get_default_score_fn("loss") fetches metric from attached engine’s state and by default there is no “loss” key in trainer.state.metrics. See Checkpoint — PyTorch-Ignite v0.4.8 Documentation

By the way, since v0.4.8 user can use str as save_handler without using DiskSaver.

However, if I do not define score_name or score_function , the script runs fine, but the model is not saved. How can I fix this?

This is very strange, I can try to reproduce the issue, but if you could provide a minimal code example, it would be very helpful. Thanks

AfonsoSalgadoSousa · February 28, 2022, 10:22am

Thanks for the reply. The following code runs successfully but the model is not saved to the newly-created directory.

from functools import partial

import ignite.distributed as idist
import torch
from ignite.engine import Engine, Events
from ignite.handlers import Checkpoint, DiskSaver, global_step_from_engine
from torch import nn, optim
from torch.nn import CrossEntropyLoss
from torch.utils.data import DataLoader, TensorDataset


def train_step(engine, batch):
    X, y = batch[0].to(idist.device()), batch[1].to(idist.device())
    model.train()
    loss_fct = CrossEntropyLoss()
    y_pred = model(X)
    loss = loss_fct(y_pred, y)
    print(loss)
    return loss.item()


x = torch.randn(8, 2)
y = torch.empty(8, dtype=torch.long).random_(4)
my_dataset = TensorDataset(x, y)
my_dataloader = DataLoader(my_dataset)
model = nn.Sequential(nn.Linear(2, 10), nn.ReLU(), nn.Linear(10, 4)).to(idist.device())
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

trainer = Engine(train_step)

best_model_handler = partial(
    Checkpoint,
    {"model": model},
    DiskSaver(dirname="./testing", require_empty=False),
    filename_prefix="best",
    n_saved=2,
    global_step_transform=global_step_from_engine(trainer),
)
trainer.add_event_handler(Events.COMPLETED, best_model_handler)

trainer.run(my_dataloader, max_epochs=1)

vfdev-5 · February 28, 2022, 11:11am

@AfonsoSalgadoSousa I think there is a bug in this code snippet as best_model_handler is partial and not Checkpoint.
If you update it as trainer.add_event_handler(Events.COMPLETED, best_model_handler()) and rerun the code:

ls testing/

best_model_1.pt

AfonsoSalgadoSousa · February 28, 2022, 11:32am

I am sorry for the inconvenience. That was it—another bug.

AfonsoSalgadoSousa · February 28, 2022, 11:53am

BTW, @vfdev-5, what is the default behaviour when neither score_name nor score_function are set?

vfdev-5 · February 28, 2022, 12:07pm

Checkpoint should just save into the file the content of to_save on the called events.
For example:

    handler = Checkpoint(
        to_save, '/tmp/models', n_saved=2
    )
    trainer.add_event_handler(Events.ITERATION_COMPLETED(every=1000), handler)

If trainer does 3000 iterations, handler will be called 3 times and will save 3 files and remove the first file (as n_saved=2).