Average time per epoch calculation on pytorch ignite

Tanya_Boone · May 22, 2020, 3:53pm

Hello,

I need to save a file with the average elapsed time per epoch and I found this function on the documentation:

https://pytorch.org/ignite/_modules/ignite/handlers/timing.html

 t = Timer(average=True)
for _ in range(10):
       work()
       idle()
       t.step()

def save_time(engine):
     torch.save(t.value(),'time.pt´)

trainer.add_event_handler(Events.EPOCH_STARTED(every=3), save_time)

this works but It shows the elapsed time in 3 epochs and I suppose the file overwrites every 3 epochs but what I want is to calculate the average time just once for the first epoch and stop the timer. I don’t want to divide timer value between 3, also I don’t want to generate one file per epoch. One option is to run my code with just one epoch but I am running several models and I would like to generate this just once every time I am going to train.

How can I achieve this?

pd: the documentation show an example doing it by iteration, but I am unable to translate this example to epoch. I believe the documentation should extend to epoch example as well.

Regards,

sdesrozis · May 22, 2020, 7:51pm

@Tanya_Boone Thank you for this question.

I think that several things could help you.

First, you can use Timer with filtered event. I think that in your case, you should try the following code to reset the timer every 3 epochs (started)

timer.attach(trainer,
             start=Events.EPOCH_STARTED(every=3),
             resume=Events.ITERATION_STARTED,
             pause=Events.ITERATION_COMPLETED)

An other option is to use directly Timer (without step to do not average)

timer = Timer()

@trainer.on(Events.STARTED)
def time_start():
    # reset the timer at the beginning
    timer.reset()

@trainer.on(Events.EPOCH_STARTED(every=3))
def time_start():
    # use value() to get the time
    print("timer of 3 epochs :", timer.value())
    # reset the timer from now for 3 next epochs
    timer.reset()

Last point I would like to mention, we recently added epoch timers in trainer.state.timers. That is another option which should help you to mesure epoch times

@trainer.on(Events.EPOCH_COMPLETED)
def print_timers():
    # trainer.state.timers is a dict
    print("timers=", trainer.state.timers)

If you need more support, don’t hesitate to ask, it would be a pleasure to help

HTH

vfdev-5 · May 23, 2020, 11:20am

Thanks Sylvain !

@Tanya_Boone please let us know if Sylvain’s answer solves your problem. Thanks !

Tanya_Boone · May 23, 2020, 1:33pm

Hello,

I am trying with this method

timer.attach(trainer,start=Events.EPOCH_STARTED(every=1),
    resume=Events.ITERATION_STARTED,
    pause=Events.ITERATION_COMPLETED)

After I want to print the value

print(“timers=”, trainer.state.timers)

Engine run is terminating due to exception: ‘State’ object has no attribute 'timers". I changed to timer but I am getting the same error

sdesrozis · May 23, 2020, 1:37pm

This feature is not in stable 0.3.0. You have to use git version or nightly version. Sorry to miss that point.

vfdev-5 · May 23, 2020, 5:28pm

To install nightly release: https://github.com/pytorch/ignite#nightly-releases

gnufabio · November 30, 2024, 10:36am

Hi all, I am sorry to bum an old conversation, but my question is related to the question of the OP.

I’m trying to measure the average wall-clock time per training epoch with Ignite’s Timer. I then want to log it to wandb when the training is completed or terminated (e.g. due to a Ctrl+C exception).

Right now I am using this code:

timer = Timer(average=True)
timer.attach(
    train_engine,
    start=Events.STARTED,
    resume=Events.EPOCH_STARTED,
    pause=Events.EPOCH_COMPLETED,
    step=Events.EPOCH_COMPLETED,
)

def log_timer(engine: Engine):
    print(f"Average time per epoch: {self.timer.value():.2f}")
    wandb.log({"avg_epoch_time": timer.value()})
    timer.reset()

train_engine.add_event_handler(Events.COMPLETED | Events.TERMINATE | Events.EXCEPTION_RAISED, log_timer)

However, the logged time-per-epoch seems unreasonably low (like 0.01, while I was expecting something like 2-4 seconds). Am I doing something wrong in the definition of the Timer?

vfdev-5 · November 30, 2024, 1:23pm

@gnufabio hard to see what can go wrong without some code.
It may depend on whether you do some work on EPOCH_COMPLETED or EPOCH_STARTED events and attached the timer first before attaching these time consuming handlers ?
Maybe, as you average the times, you have something less expected ?
Alternatively, you can try to use engine.state.times[Events.EPOCH_COMPLETED.name], State — PyTorch-Ignite v0.5.1 Documentation for that.

gnufabio · December 2, 2024, 9:17am

@vfdev-5 thanks for your answer.

I think I found the problem - it was due to log_timer being called both for Events.TERMINATE and Events.COMPLETE. timer.value() was low because I was resetting the timer after Events.TERMINATE was fired. I solved by removing the log_timer handler after its first call.

That being said, I am not sure why Events.TERMINATE and Events.COMPLETE are both fired, since in the official documentation they seem to be mutually exclusive.
Just as a reference, I have a bunch of other handlers attached to the engine:

Handler to run the validation loop at the end of each epoch
ModelCheckpoint used to save the checkpoint and restore the weights when trainer fires the TERMINATE or COMPLETE event
EarlyStopping fired on evaluator’s Events.COMPLETE
Handler to catch Ctrl+C and gracefully terminate the training

vfdev-5 · December 2, 2024, 10:19am

Glad you could find the problem and hope it works now as expected. Feel free to ask other questions here or on our discord.

That being said, I am not sure why Events.TERMINATE and Events.COMPLETE are both fired, since in the official documentation they seem to be mutually exclusive.

We provided some info about that here: Engine — PyTorch-Ignite v0.5.1 Documentation

It was a question whether we should skip or not Events.COMPLETE when terminate() was called and it was decided to keep it. Maybe, it could be a feature to add a flag to terminate(skip_complete_event=True) to skip Events.COMPLETE.

gnufabio · December 2, 2024, 1:17pm

I think that would make things more intuitive and flexible. While from the Engine’s docs it is clear that trainer.terminate() fires both Events.COMPLETE and Events.TERMINATE, in the Events’ docs there is a table that looks misleading…

Or perhaps I am misinterpreting it? Anyways, thank you for your kind support!

vfdev-5 · December 2, 2024, 1:34pm

You are right, the table needs to be updated to make it clear.

PS: if you wish to make this contribution to the project, you are very welcome!