Epoch_length wrong number of iterations

xen0f0n · March 2, 2020, 3:40pm

I want to iterate over the training samples (172 samples) a fixed number of times (e.g. 512)

trainer.run(train_loader, max_epochs=500, epoch_length=512)

and in my Dataset subclass implementation I’ve defined __len__ to return the number of samples (172 samples).
But epoch_length does not stop loading at 512, but at 688 samples (4x172). I’ve noticed this behavior by printing a counter when __getitem__ is called.
Any thoughts?

vfdev-5 · March 2, 2020, 8:00pm

Hi @xen0f0n, what is the expected behaviour for you ?
I think what happens is a normal behaviour:
Your data provider has 172 samples, when engine is asked to measure an epoch as 512 iterations it does not mean that data provider should be restarted after 512 iterations.

data
|-----|-----|-----|-----|-----|-----|-----|-----|
epoch
|----------|----------|----------|----------|

So, trainer asks for data from the data provider and on 512th iteration, data loader has been restarted previously twice and provides 168th sample, then on 513th iteration (and epoch + 1), provided data corresponds to 169th sample from data loader.

What do you think ?

xen0f0n · March 2, 2020, 8:08pm

@vfdev-5 Well, it would make sense if on 513th I was on epoch + 1.
But I don’t think that’s the case. At the completion of each epoch I log some metrics and this takes place on iteration 688 and not before. Hence, that’s when the epoch ends, right?

vfdev-5 · March 2, 2020, 8:11pm

OK, we need to check versions and the code. Here is my snippet

import ignite
print(ignite.__version__)

from ignite.engine import Engine

trainer = Engine(lambda e, b: print("{} - {} : {}".format(e.state.epoch, e.state.iteration, b)))

data = list(range(172))
trainer.run(data, epoch_length=512, max_epochs=10)

with the output

0.4.0.dev20200302
1 - 1 : 0
1 - 2 : 1
1 - 3 : 2
1 - 4 : 3
1 - 5 : 4
1 - 6 : 5
1 - 7 : 6
1 - 8 : 7
1 - 9 : 8
1 - 10 : 9
1 - 11 : 10
...
1 - 510 : 165
1 - 511 : 166
1 - 512 : 167
2 - 513 : 168
2 - 514 : 169
2 - 515 : 170
2 - 516 : 171
2 - 517 : 0
2 - 518 : 1
2 - 519 : 2
2 - 520 : 3
2 - 521 : 4
2 - 522 : 5
2 - 523 : 6
...
10 - 5112 : 123
10 - 5113 : 124
10 - 5114 : 125
10 - 5115 : 126
10 - 5116 : 127
10 - 5117 : 128
10 - 5118 : 129
10 - 5119 : 130
10 - 5120 : 131

xen0f0n · March 2, 2020, 8:49pm

@vfdev-5 I got the same output. But I still can’t explain that the evaluation happening on EPOCH_COMPLETED
@trainer.on(Events.EPOCH_COMPLETED)
takes place NOT on iteration 512 but on 688. I’ve also tried this for epoch_length=3, and engine.state.iteration is 172 when logging.


@trainer.on(Events.EPOCH_COMPLETED)
    def evaluate(trainer):
        with evaluator.add_event_handler(Events.COMPLETED, log_metrics, "train"):
            evaluator.run(train_loader)

It would make more sense to me if the both Events.ITERATION_COMPLETED(every=512) and Events.EPOCH_COMPLETED(every=1) fired the “same time”.

xen0f0n · March 2, 2020, 9:03pm

@vfdev-5 it seems I’ve made a mistake… Trainer accepts epoch_length as argument, BUT I’ve created an evaluator engine with create_supervised_evaluator and although trainer.state is correct, evaluator.state doesn’t have the same epoch and iteration.

vfdev-5 · March 2, 2020, 9:15pm

So, could you find where is a problem in your code ?

although trainer.state is correct, evaluator.state doesn’t have the same epoch and iteration .

evaluator.state normally contains nothing interesting except metrics. As we usually run it
like evaluator.run(train_loader) it makes 1 epoch defined by len(train_loader) and its should contain according to your example, the following:

iteration: 172
epoch: 1
epoch_length: 172
max_epochs: 1

And it is correct, evaluator did 1 epoch and 172 iterations as asked. And it does not contain any training information.

xen0f0n · March 2, 2020, 9:32pm

@vfdev-5 I am a bit confused. I use the same evaluator for both train_loader and val_loader. Are the metrics calculated during training, or does the evaluator iterate the set again after the training epoch (with other random transforms) , just without updating the weights?

On the train set, the metrics should take into account the samples of the whole epoch (512 samples) and not just the len(train_loader). Having batch_size=1, I should have 512 CE losses and calculate the mean. And IoU should be averaged for 512 samples (not 172).

vfdev-5 · March 2, 2020, 9:55pm

I use the same evaluator for both train_loader and val_loader . Are the metrics calculated during training, or does the evaluator iterate the set again after the training epoch (with other random transforms) , just without updating the weights?

Evaluator does not compute metrics during the training. Yes, evaluator iterates the set again and metrics are computed with “fixed” model (without updating the weights).

On the train set, the metrics should take into account the samples of the whole epoch (512 samples)

in this case you need to just set epoch_length=512 in evaluator.run(train_loader, epoch_length=512).