High CPU usage normal?

FMorsbach · October 4, 2022, 2:09pm

Hi there,

I have a question regarding the CPU usage when training with Opacus. I noticed that when training with Opacus the CPU usage explodes compared to non-private training. I am wondering whether this is expected or some kind of bug?

For example, when I train the simple MNIST example from the Github repo, my CPU usage spikes to 6000% (EPYC 7702P). Disabling training with DP cuts down CPU usage to 100%. I can observe similar behavior when training CIFAR10 with a small NN. However, I can confirm that the GPU is actually being utilized as data is being transferred there and GPU utilization increases.

It seems to be somewhat connected to the data loader: If I instead of using the “private” data loader returned from the PrivacyEngine, use the non-private data loader, this behavior is not apparent. I get that the private data loader does a Poisson sampling instead and also the privacy accounting is happening (presumable on the CPU), but does this really consumes that much CPU? Am I missing something?

Is there maybe an overview somewhere, to see what steps of DP-SGD happens on the GPU and what happens on the CPU?

Best regards and thanks in advance!

pstock · October 5, 2022, 3:09pm

Hey FMorsbach,

Thanks for your interest and your detailed question. Higher CPU utilization would be expected when subclassing PyTorch dataloader, however I’m unable to gauge whether the numbers your report are in line with what to expect.

I would have two follow-up questions:

What about utilization if you still use the private dataloader but you disable Poisson sampling?
What about increasing/diminishing the number of workers (and what’s your current number of workers)?

Pierre

FMorsbach · October 7, 2022, 11:49am

Hi,

thanks for your reply!

One additional behavior I noticed (not sure if relevant), without any modification to the MNIST example, it spawns one thread per CPU core and fully utilizes all of them. This leads on a 64C128T CPU (7702P) to 6400% CPU usage with 64 threads, on a 10C20T (9900X) to 1000% with 10 threads.

But here are some numbers regarding your questions:

Non-Private Dataloader + num_workers=0 → 6400% CPU, ~60 it/s
Non-Private Dataloader + num_workers=1 → 200% CPU, ~105 it/s
Non-Private Dataloader + num_workers=5 → 250% CPU, ~105 it/s
Private Dataloader + num_workers=0 + poisson_sampling=True → 6400% CPU, ~55 it/s
Private Dataloader + num_workers=0 + poisson_sampling=False → 6400% CPU, ~60 it/s
Private Dataloader + num_workers=1 + poisson_sampling=True → 6400% CPU, ~105 it/s
Private Dataloader + num_workers=1 + poisson_sampling=False → 200% CPU, ~105 it/s
Private Dataloader + num_workers=5 + poisson_sampling=True → 6400% CPU, ~105 it/s
Private Dataloader + num_workers=5 + poisson_sampling=False → 250% CPU, ~105 it/s

From this, it seems like the poisson sampling is actually the major CPU hawk here. For further analysis I used 1 worker and poisson sampling as a baseline but limited the process to 1 CPU thread and still achieved ~75 it/s (so still ~75% of the performance). So apparently there is a benefit in allocating more CPU threads. With 8 CPU threads it already reached ~95/100 it/s. So to me it seems like it does scale with more threads, but not very good.

If I do not limit the CPU threads per process and start 4 trainings at the same time (so spawning 64*4 threads) each with a dedicated 3090, the it/s drops to 35/40 for each training. So there is also a problem when scaling to more than two GPUs per system (maybe per socket? I don’t have a dual socket system at hand for testing).

But in conclusion, I am still wondering how the number of threads for the poisson sampling is determined and if this much CPU is actually necessary or whether we could optimize this?

Best regards,
Felix

pstock · November 8, 2022, 9:30am

Hey Felix,

Thanks a ton for all the detailed measurements. From your analysis it clearly appears that poisson sampling is the bottleneck, independently of the number of workers.

After some digging, I think this may come from our implementation of UniformWithReplacementSampler. I think with secure_rng option on it should indeed be slow, did you activate this flag? Otherwise, the overhead still seems high to me.

Regarding scaling to more GPUs/threads, did you try out DistributedUniformWithReplacementSampler?

Hope this helps,
Pierre