Scheduler state_dict size


I am currently debugging an error which causes my program to crash when resuming a training. Resuming the scheduler causes an out of memory error and I might create a seperate thread about this problem.

Currently I am investigating my scheduler and I am wondering why the size of the saved scheduler.state_dict() is so big.

I use an SGD optimizer and the OneCycleLR scheduler.

optimizer = torch.optim.SGD(objective_params, lr=lr, momentum=mo, weight_decay=wd)
scheduler = OneCycleLR(optimizer, max_lr=lr, total_steps=conf.max_iter)

Now I immediatly save the state_dicts of the model, the optimizer and the scheduler., modelpath), schedpath"), optpath)

The files have the following sizes:

model: 852,7 MiB
optimizer: 1,9 KiB
scheduler: 851,9 MiB

Then I start the training and because I use mixed precision it takes some stepts until the optimizer actually performs the step() function.

After it has the optimizer has stepped the sizes of the saved state dicts is the following:

model: 852,7 MiB
optimizer: 843,9 MiB
scheduler: 1,7 GiB

Why the size of the scheduler so enourmous? It makes sense that SGD uses the same amount of memory as the model because every weight has a momentum value but I do not understand why the scheduler takes up so much memory. Printing out the scheduler state_dict does not answer my question as it is basically empty:

{'total_steps': 200000, 'step_size_up': 59999.0, 'step_size_down': 140000.0, 'anneal_func': <bound method OneCycleLR._annealing_cos of < object at 0x7fd8a1061390>>, 'cycle_momentum': True, 'use_beta1': False, 'base_lrs': [4e-05], 'last_epoch': 10, '_step_count': 10}

Greetings Rupert

1 Like

Two years late…But I came across the same problem recently. The root cause is the anneal_func entry in the scheduler state_dict. Since scheduler.anneal_func is a reference to OneCycleLR._annealing_cos, the whole scheduler is saved along with the scheduler.optimizer attribute, which is normally not saved in scheduler.state_dict. The problem is, scheduler.optimizer.param_groups contains all the model parameters. That is why scheduler.state_dict takes up so much space.

For more info please refer to serializes a bound method in OneCycleLR's state_dict, causing CUDA problems · Issue #42376 · pytorch/pytorch · GitHub.