I have a model for which gradient computations work fine, but then I get OOM when runnning optimizer.step() with Adam. I understand the Adam optimizer state needs much more memory due to first and second deltas and float32. But I fail to understand why that is really an issue.
Would it not be very simple to offload this optimizer state to CPU and then do the update computations in blocks of smaller size? The Adam updates are fully separable between parameters, and CPU offloading and doing it in blocks should be highly subdominant to computing gradients.
Am I missing something? Is this implemented somewhere?
Even simpler would be to do keep model and optimizer on CPU and do Adam updates there. This would need the model to be copied to each GPU device and the gradients copied back to CPU, but this is the same effort than needed in DDP anyway per round. And while CPU computation is slow compared to GPU, the Adam updates are trivial compared to gradient computations, so this should not make a difference.
Is this implemented somewhere?
That’s not the case as DDP keeps the model on the used devices without triggering any copies. Moving the model back and forth in each iteration would cause a large performance regression.
You might be interested in ZeRO-Offloadfor your use case.
Thanks. When I read their pitch, it sounds like more or less what I am saying here. Cool, thanks.
About your comment (just for me to understand):
In DDP we need to send weights to all devices and collect gradients from all devices. Isn’t this just the same as sending the model across (assuming that dealing with the weights is far more expensive than creating all the objects)? However, it may be that transfer CPU ←→ GPU is slower than transfer between GPUs. I am still learning about this space.
DDP will only clone the model once at the initialization from rank0 to all other ranks. Afterwards, each rank will use its own clone of the model and only the gradients will be synchronized.
OK, but that is once a reduce of gradients to rank 0, and then a broadcast back of gradients to all ranks. I suppose what you say is that this is faster than a reduce of gradients to rank 0, optimizer step there, and broadcast of new weights back. I am just not so much on top of the hardware details.
Thanks, I learned something!
No, in DDP, it reduces the gradients at rank 0; however, it doesn’t need to broadcast!
Since we have reduced the gradients, we can also update the parameters and broadcast the updated parameters instead, right? That setup would be a parameter server architecture, which is conceptually different from DDP
In DDP:
Suppose we have four GPUs (ranks 0–3) on a single node. Each rank computes gradients on its local mini-batch and broadcasts its gradient to all other ranks. Then, each rank receives gradients from all other ranks. For example, rank-2 receives gradients from ranks 0,1, and 3, then these gradients are averaged across all ranks (in rank-2, it sums all the gradients and divides the sum by 4). Therefore, there is no need to broadcast the reduced gradients (it is simply useless, think about it
)
Once the averaging is complete, every rank updates its own copy of the model parameters. As a result, the model weights remain identical across all ranks. This process repeats for each training batch.
Hope it clarifies!