Optimizers memory usage

Hi there,
I was training a model with SGD and decided to move to Adam.

I was using batch size 20 for SGD, however max BS i can use with Adam is 2.

    optimizer = torch.optim.SGD([{'params': model.unet_model.parameters()},{'params': model.audio_s.parameters()}, {'params': model.drn_model.parameters(), 'lr': args.DRNlr},
                                 ], lr=LR,
                                weight_decay=WEIGTH_DECAY)

All I did is changing this line

    optimizer = torch.optim.Adam([{'params': model.unet_model.parameters()},{'params': model.audio_s.parameters()}, {'params': model.drn_model.parameters(), 'lr': args.DRNlr},
                                 ], lr=LR,
                                weight_decay=WEIGTH_DECAY)

is there any memory usage comparison among all the optimizers? or is that memory usage normal?

An increased memory usage for optimizers using running estimates is normal.
As the memory usage depends on the number of your parameters, I’m not sure someone already compared it.

@ptrblck Let me ask an additional question.

How can i choose where those estimators are stored?
I guess they are stored in gpu0 by default but even using several gpus, the gpu which stores those estimators get out of memory. I would like to use 1 gpu for computing them and the rest for dealing with the model using dataparallel.

Is that possible in pytorch?

As far as I know, the optimizer will store the internal parameters on the same GPU the model was transferred to.
Usually it’s cuda:0. Since my multi-GPU machine is currently busy, I cannot test it.
You could check the device placement with print(optimizer.param_groups) and see where the tensors are stored.

@ptrblck
Those parameters are stored in gpu0.
When you talk about transferring model in case of data parallel: the main gpu would be
model = torch.nn.DataParallel(model).cuda()
cuda:0 in this case right?

Well i’m using 3 gpus so I’m trying to do
model = torch.nn.DataParallel(model,device_ids=[1,2]).cuda()

To enforce pytorch deal with the model in cuda:1 and cuda:2 but pytorch does not allow to do that, it requires all tensors to be in devices[0]

Traceback (most recent call last):
  File "train.py", line 300, in <module>
    main()
  File "train.py", line 260, in main
    output = model(video,audio)
  File "/home/jfmontesinos/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jfmontesinos/.local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 113, in forward
    replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
  File "/home/jfmontesinos/.local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 118, in replicate
    return replicate(module, device_ids)
  File "/home/jfmontesinos/.local/lib/python2.7/site-packages/torch/nn/parallel/replicate.py", line 12, in replicate
    param_copies = Broadcast.apply(devices, *params)
  File "/home/jfmontesinos/.local/lib/python2.7/site-packages/torch/nn/parallel/_functions.py", line 17, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
  File "/home/jfmontesinos/.local/lib/python2.7/site-packages/torch/cuda/comm.py", line 40, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: all tensors must be on devices[0]

Is not possible either to store optimizers’ parameters in an arbitrary gpu or to reduce workload in cuda:0 not to get out of memory?

Based on my understanding of Adam optimizer, you have to store an extra 2 or more pieces of data (historical gradients), so basically for every one fp32, you are going to store an extra 2 pieces of data, thus tripling memory consumption.