The Opacus example, train batch size vs sampling rate

Hi opacus team:
I am doing a test by using this example project:

One specific thing I noticed is the train_batch_size is defined as here:

batch_size=int(args.sample_rate * len(train_dataset))

Therefore a train dataloader 's batch size depends on the sampling rate
For this cifar10 example code, the sample rate is 0.04 and the train batch size will be 2000, which is typically too larger than usual, so I am trying to decrease the sample rate to something like 0.004 (and the train batch size is 200 now).
But in such a case, the train does not converge at all (after 5 epoch train, the accuracy keeps straying around 10%)
Also, I expect this sampling rate para would be used for Dataloader Wrapper (like DPDataLoader) but that does not happen in the example, the sampling rate is only used for the above piece of code for getting the train_batch_size.
My questions:

  1. What is the logic behind that–the sampling rate decides the train batch size?
  2. something like 2000 for batch size is too big and not practical, but how can I get the train to be convergency for using a smaller batch size?
  3. I am not sure if there are some other parameters also relevant --so far I am using the default parameter as this example project provided.

I find in FAQ something relevant, but still, get confused.

Assuming that batches are randomly selected, an increase in the batch size increases the sampling rate, which in turn increases the privacy budget. This effect can be counterbalanced by choosing a larger learning rate (since per-batch gradients approximate the true gradient of the model better) and aborting the training earlier.

Usually, I can not have a random batch size but a fixed value (which is more HW friendly?)

  1. If I want to have a predefined batch_size, then what is the appropriate sampling rate, then the privacy budget?
  2. Furthermore if I have a predefined privacy budget as well, then what is the correct way for getting a right sampling rate?
  3. Then finally how to find a good learning rate (or an LR schuelder)?

Hello,

Thanks for your question.

This post made me aware that in our Cifar-10 example need to be updated, to be similar to other examples we have where sample rate is not defined as an input. We do not require sample rate to be an explicit argument, and we like to infer it from the data loader. For example, see here (“It’s automatically inferred from the data loader”):

So I will go ahead and update the cifar-10 example to be similar to other examples, where batch_size is directly provided as an argument. Also, the main idea behind sample_rate * len(train_dataset) was this.

Regarding the other parameters and convergence, yes they are relevant. For example learning rate, etc.

And regarding your second post, yes, you can have a pre-defined batch size after I send the fix for the cifar-10 example.

Regarding your questions

If I want to have a predefined batch_size, then what is the appropriate sampling rate, then the privacy budget?
Furthermore if I have a predefined privacy budget as well, then what is the correct way for getting a right sampling rate?

You basically do not need sampling rate any more, and only batch size (see here)

Thanks, this more or less clears my questions–replacing the sample rate with len(dataset)/batch_size is the right direction (more friendly to pytorch user)
Only problem right now for me is: that when I am using the smaller batch size (like 64 rather than 2000 in this example code), I will have much poor convergence – I am not sure what else parameters to be tunned.

Hi @Leonmac
The batch size is one of the key hyperparameters for DP training and as a rule of thumb - the larger the batch size, the better convergence. Few papers have explored this, one example: https://arxiv.org/pdf/2110.05679.pdf

Large batch sizes might be problematic for two reasons.
If the problem is fitting in into memory, I suggest to take a look at BatchMemoryManager - it allows training with large batches with limited memory footprint.
Another potential problem is just slower training - and here there’s not much you can do. On average, given the limited compute budget, it’s still beneficial to use larger batch size to minimize the impact of the noise.