Dataloader not scaling with number of workers

Hi there,

I am experiencing a performance issue in my dataloader that I’d appreciate some help with. The crux of the issue is the following.

for i, data_to_feed_model in enumerate(dataloader):
     end = timer()
     if i > 0:
          print(f"Time spent waiting for batch from dataloader = {end - start}")

     ## Perform model forward, backward with data_to_feed_model ##

     start = timer()

The collate_fn of my dataloader has a bottleneck function that is basically doing the following:

for i in range(N):
     samples = torch.multinomial(...)

I have also implemented a version of this critical section in C++ and exported it as a pybind11 module and that gives me about a 4x speedup for my problem size when timing that critical section on its own.

However, when timing how long the main training loop spends waiting for a batch from the dataloader I notice the following:

  1. For the C++ implementation - when using multiple workers in my dataloader the time it takes to wait for a batch in the training loop scales linearly upto 4 workers - i.e. there is a linear decrease in the time when scaling from 1 - 4 workers. But when I use 8 workers there is no decrease in the waiting time and takes just as long (if not longer) to wait for a batch as it did in the 4 worker case. However if I comment out my entire training loop and only pull batches out of the dataloader and never use them in any actual computation, like so:
for i, data_to_feed_model in enumerate(dataloader):
     end = timer()
     if i > 0:
          print(f"Time spent waiting for batch from dataloader = {end - start}")

     start = timer()

It continues to scale linearly upto 64 workers.

  1. For the Python implementation - the same issue still exists, however, peculiarly that threshold beyond which it stops scaling is reached at 16 workers (where as with the C++ implementation of bottleneck it was reached at 4 workers). And once again, if there is no actual computation in the training loop, it continues to scale up to 128 workers.

I would really appreciate some help trying to understand what’s going on here.

Thanks you very much for your time in reading this and any pointers you can give me. If there are any clarifications questions you have, I’d be more than happy to answer them.

How many physical CPU cores are available on the system you are benchmarking on? In many cases numerical preprocessing won’t scale beyond the number of physical CPU cores.

Additionally, is there any device-side work/host-to-device transfers occurring? If so you might need add synchronize calls to ensure that the timing is being done properly.

Hi there thanks for getting back to me:

  1. Here’s a snippet from lscpu on my benchmarking machine:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 52 bits physical, 57 bits virtual
CPU(s): 160
On-line CPU(s) list: 0-159
Thread(s) per core: 2
Core(s) per socket: 40
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 106
Model name: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz

So it looks like I have plenty of CPU cores.

  1. I do not have any async host to device transfers happening in my code, they’re all happening with blocking calls (non_blocking=False)

Interesting, could it be possible that you are saturating the device bandwidth with four workers already?

I would try to check how many samples/batches are copied to device memory per second via some napkin math and see if the bytes/second achieved there makes sense.

I’m confused why host to device transfer bandwidth would matter here? Or more specifically why employing more dataloader workers puts more pressure on that bandwidth? It’s still only copying one batch worth of data from host to device in each iteration right?

If it matters I am using an H100 GPU.

Thanks again for your responses!

I am also noticing a few more things:

  1. When monitoring CPU% of each process, the CPU% of the main process increases when I add more dataloader workers, and is always more than 100% for num_workers >= 8 (Over 300% for num_workers=16). Is this expected behavior?

  2. The CPU% of each of the dataloader worker processes is < 50%, regardless of the number of workers

  3. Upto 8 workers, restricting the number logical cores visible to the worker processes (with taskset) such that only num_workers number of logical cores can be used seems to improve performance by about 10%.

Not sure if any of this helps in understanding what’s happening but I thought I’d post them anyway.

Any pointers are greatly appreciated!