Using Opacus in Federated Learning -- sample_level privacy vs user_level privacy

@karthikprasad
I open a new post for tracking this question:
Below are my original question and your reply:

do I NEED to call the make_private each time when I get the updated aggregated model from server side?

I am not familiar with Flower, but I suspect you might need to make some tweaks. Opacus’s make_private creates a model that takes care of per-sample gradient computation and noise-addition during SGD. In the FL setting, you don’t need per-sample gradients to preserve user-level privacy (as discussed in [1710.06963] Learning Differentially Private Recurrent Language Models). Rather, you each client update itself is a gradient and you can simply aggregate them all, add noise, and update the central model.
A sample implementation of this can be found at FLSim/sync_dp_servers.py at main · facebookresearch/FLSim · GitHub. Note that this doesn’t use opacus to add noise or to wrap the model, but still achieves DP.

I find some other post, that mentioned user-level privacy vs sample_level privacy: BTW, Is there any specific explanation about that? My understanding is that sample-level protection protects each data sample in the training data-set. What is user-level privacy then?

It looks likely what you mean is: that if I am doing FL, I don’t need make_private on the clients side (as it will do the sample-level protection) but I can simply do the noise adding on server side ( Rather, you each client update itself is a gradient and you can simply aggregate them all, add noise, and update the central model.)

Fundamentally, in an FL system, the real stuff user wants to protect is still the individual data (for an image classification problem, the protection target is each image that the client contributes for learning) – is that the right understanding?

Hi @Leonmac ,

My understanding is that sample-level protection protects each data sample in the training data-set.

That is correct.

What is user-level privacy then?

Fundamentally, in an FL system, the real stuff user wants to protect is still the individual data (for an image classification problem, the protection target is each image that the client contributes for learning) – is that the right understanding?

Let me answer both the questions together here. User-level DP ensures that the probability distribution on the published results of an analysis is “essentially the same,” independent of whether any client opts in to, or opts out of, the data set. In the context of Federated Learning it ensures that an attacker with access to intermediate model states cannot conclude with confidence whether a particular client participated in training. This also means that user-level DP has a stronger privacy protection.

The choice between sample-level and user-level DP in FL setting depends on whether a user is only protective of their data or also protective of their participation. The threat model is also slightly different: in sample-level DP, the clipping and noising of gradients happens on the device. In the user-level DP, clipping can happen on user’s device but noise addition happens on the server after aggregation, which necessitates trust in the server by the users.

1 Like

@karthikprasad Thanks for the reply. So a simple conclusion (maybe too simple?) would be:

  1. If a user wants to protect “the individual data”, then a per-sample level DP is needed and DP implemented on the participant side.

  2. if a user wants to protect “the participation” (i.e. nobody can conduct if this user participates and contributes the data–is this correct understanding?), then user-level DP is needed and DP implemented on the server side (noise addition happens on the server after aggregation) – in such case, the user must trust the server.

  3. A user may need both 1 and 2, in such case the DP protection needed to be implemented for both client and server?

Please correct me if anything is not right. Thanks a lot.

Hi @Leonmac

I believe you are right.

  1. Yes
  2. Yes
  3. Yes, in some settings, we may have both sample-level and client-level DP, though, typically user-level DP is a stronger guarantee. You can see here: Learning with User-Level Privacy | OpenReview