Hello,
I am currently debugging an error which causes my program to crash when resuming a training. Resuming the scheduler causes an out of memory error and I might create a seperate thread about this problem.
Currently I am investigating my scheduler and I am wondering why the size of the saved scheduler.state_dict()
is so big.
I use an SGD optimizer and the OneCycleLR scheduler.
optimizer = torch.optim.SGD(objective_params, lr=lr, momentum=mo, weight_decay=wd)
scheduler = OneCycleLR(optimizer, max_lr=lr, total_steps=conf.max_iter)
Now I immediatly save the state_dicts of the model, the optimizer and the scheduler.
torch.save(model.state_dict(), modelpath)
torch.save(scheduler.state_dict(), schedpath")
torch.save(optimizer.state_dict(), optpath)
The files have the following sizes:
model: 852,7 MiB
optimizer: 1,9 KiB
scheduler: 851,9 MiB
Then I start the training and because I use mixed precision it takes some stepts until the optimizer actually performs the step()
function.
After it has the optimizer has stepped the sizes of the saved state dicts is the following:
model: 852,7 MiB
optimizer: 843,9 MiB
scheduler: 1,7 GiB
Why the size of the scheduler so enourmous? It makes sense that SGD uses the same amount of memory as the model because every weight has a momentum value but I do not understand why the scheduler takes up so much memory. Printing out the scheduler state_dict does not answer my question as it is basically empty:
{'total_steps': 200000, 'step_size_up': 59999.0, 'step_size_down': 140000.0, 'anneal_func': <bound method OneCycleLR._annealing_cos of <lib.lr.OneCycleLR object at 0x7fd8a1061390>>, 'cycle_momentum': True, 'use_beta1': False, 'base_lrs': [4e-05], 'last_epoch': 10, '_step_count': 10}
Greetings Rupert