DPOptimizer: filter parameters to add noise to

Hello
I modified Opacus source code to create a modified version of DPOptimizer that will add noise only to some specified parameter groups of the the underlying optimizer. I did this because I was having countless errors, like this one:

self = SGD (
Parameter Group 0
    dampening: 0
    foreach: None
    initial_lr: 0.0001
    lr: 0.0001
    maximize: False
    momentum: 0.9
    nesterov: False
    weight_decay: 0.0
)

    def clip_and_accumulate(self):
        """
        Performs gradient clipping.
        Stores clipped and aggregated gradients into `p.summed_grad```
        """
    
        per_param_norms = [
            g.reshape(len(g), -1).norm(2, dim=-1) for g in self.grad_samples
        ]
>       per_sample_norms = torch.stack(per_param_norms, dim=1).norm(2, dim=1)
E       RuntimeError: stack expects each tensor to be equal size, but got [300] at entry 0 and [50] at entry 12

I realized I could solve this if I could “skip” DP for some layers, but I could not split one optimizer into two and wrap only one around DPOptimizer because this would turn the training loop into a multi-optimizer one and I’ve struggled integrating Opacus with Pytorch Lightning for multi-optimizer LightningModules. So here’s what I did:

__all__ = ["DPOptimizer"]

import typing as ty
from opacus.optimizers import DPOptimizer as OpacusDPOptimizer
import torch

# from opacus.optimizers.optimizer import _check_processed_flag, _mark_as_processed
# from opt_einsum.contract import contract


def params(
    optimizer: torch.optim.Optimizer,
    accepted_names: ty.List[str] = None,
) -> ty.List[torch.nn.Parameter]:
    """
    Return all parameters controlled by the optimizer
    Args:
        optimizer (torch.optim.Optimizer):
            Current optimizer.
    Returns:
        (ty.List[torch.nn.Parameter]): Flat list of parameters from all `param_groups`
    """
    if accepted_names is not None:
        accepted_names = [name.lower() for name in accepted_names]
    ret = []
    for param_group in optimizer.param_groups:
        if accepted_names is not None and "name" in param_group.keys():
            if param_group["name"].lower() in accepted_names:
                ret += [p for p in param_group["params"] if p.requires_grad]
        else:
            ret += [p for p in param_group["params"] if p.requires_grad]
    return ret


class DPOptimizer(OpacusDPOptimizer):
    """Brainiac-2's DP-Optimizer"""

    def __init__(
        self,
        *args: ty.Any,
        param_group_names: ty.List[str] = None,
        **kwargs: ty.Any,
    ) -> None:
        """Constructor."""
        self.param_group_names = param_group_names
        super().__init__(*args, **kwargs)

    @property
    def params(self) -> ty.List[torch.nn.Parameter]:
        """
        Returns a flat list of ``nn.Parameter`` managed by the optimizer
        """
        return params(self, self.param_group_names)

    def clip_and_accumulate(self) -> ty.Any:
        """Performs gradient clipping. Stores clipped and aggregated gradients into `p.summed_grad`."""
        return super().clip_and_accumulate()

Now I’m wondering, am I breaking anything? Of course, the same code that was raising the RuntimeError is now not raising any exception, but I’m still skeptical. Will the accountant still work properly? Is this a good solution?

Thanks!

Hi @Gianmarco_Aversano
I don’t have enough context about your system to say whether these changes affect the validity of the privacy guarantee, but I’ll try to provide high-level overview of what could go wrong.

First thing that immediately jumps out to me is the shape mismatch exception you’ve had in the first place. Errors at this stage might indicate fundamental issues which invalidate DP guarantees for the entire model.
One important requirement the model must satisfy to be viable for DP-SGD is that throughout all layers batch dimension should always remain first (or second if batch_first=False). That is, for every submodule, the input tensor should always be of shape [N, *], where N is the same as the input batch size and also the same for all layers. We need that to properly compute per sample gradients, i.e. the impact each individual sample from the dataset has on the trained model.
The error you’ve been getting (stack expects each tensor to be equal size) might indicate that somewhere within your forward() method batch dimension is not preserved.

Additionally, you need to understand the implications of “skipping” DP for certain layers. Having noise only for a subset of model parameters does not give you any guarantees about the model as a whole. It only really works if you freeze certain layers, in which case you don’t need to add noise but also you don’t update these parameters at all (see this tutorial for more details). Note, that if you do this, you don’t really need two optimizers.

Hello, thanks for replying!

For more context, the model I need to make DP compliant is GRAN, which also sports GRUCell layers…

This was also encountered by other users but I am failing to find that discussion at the moment.

If you know the model GRAN, do you know how one would apply DP to it? If not, what are the limitations of using Opacus? For example, is using GRUCell and/or a number of backward passes that is higher/lower than the number of forward passes a limitation for Opacus?

Please note that I’ve also already tried to replace all the GRUCell layers with private counterparts from Opacus (from opacus.layers.dp_rnn import DPGRUCell).

Unfortunately I’m not familiar with the specifics of GRAN. AFAIK we do support GRUCells, and in general recurrent networks are supported, and by itself should not cause shape errors you’ve been describing in your post.

Ok, for even more context:

  • I am trying to use Opacus on a graph dataset using PyTorch Geometric. Here, you do not really have batches which are Tensor objects with size (B,*)… But batches are torch_geometric.data.Data objects (or Batch/DataBatch, do not remember exactly). Can this be the problem? But if it is, then why does it work when I apply DP only to some layers?
  • With GANs, as shown in the link, we apply DP only to the discriminator. With GRAN (the model I need to make DP compliant), I was hoping to achieve the same thing.

Unfortunately, this is a weak sport of Opacus validation process. If your model does not respect the batch notation, it is possible that Opacus won’t notice it, however the privacy guarantee will be broken.

I’m not familiar with the specifics of PyTorch Geometric - do these DataBatch objects functionally equivalent to batches?

On your second question - there’s a difference between Opacus not throwing an error and you getting meaningful privacy guarantee at the end. If you only apply DP to a subset of training weights, we assume you know what you’re doing, as we try to be flexible in a way you can use opacus. However, the way you describe it, I don’t see what meaningful guarantee (at least theoretical) you get from applying DP to only a subset of training weights