I was wondering how/if it’s possible to use opacus together with PyTorch FSDP (or deepspeed) to allow for fine-tuning of large LM that doesn’t fit on a single gpu.
Right now what I managed to do is basically have each gpu compute a sample gradient, clip it and then accumulate the gradients of the different processes so that then I can add the noise.
The problem is that this is very slow and I was wondering if there was any better way to do this?
Good question - supporting FSDP could be useful for larger models, however there are certain challenges.
The biggest problem is - in order to properly clip the gradient you need to know the overall norm of the entire gradient, which in case of FSDP is spread across multiple machines. It’s not 100% clear from your post if you do this, but after you compute per sample gradient on each GPU you need to broadcast the norm to get the clipping coefficient.
This naturally adds one more point of synchronization between GPUs, which inevitably impacts performance. I don’t have a good intuition on how much it should impact the training speed - what exactly do you mean by “very slow” here?
One more thing - from your description it looks like you add noise post-syncronisation, you don’t need to do that, you can add noise on one of the workers, the same way we do for DDP
the way I clip the sample gradient is by using the no_sync context manager when calling loss.backward() so that on each gpu I only have the gradient of a single sample (I think), I then clip it and add to a local variable that basically accumulated the clipped gradients until the batch is over.
Then my real problem, but I guess it’s more of a FSDP question (have just asked about it here in case feel like taking a look ahah), is how I can set back the noisy gradient inside the model parameters .grad attribute so that I can take the optimization step.
Ok, based on what I’ve read about FSDP and no_sync, what you’re doing is broadly right (bear in mind though I’ve never used FSDP myself, so might be wrong about this).
The part that confuses me a little is this:
so that on each gpu I only have the gradient of a single sample (I think), I then clip it and add to a local variable that basically accumulated the clipped gradients until the batch is over.
Without getting too deep into FSDP context, it sounds like you’re computing per-sample gradients one-by-one? If that’s the case, that’s a good explanation why your code is slow - thing that makes opacus fast is vectorized per-sample gradient computation.