Dataloader collate function scaling

Hi there,

I have a custom dataloader that does a lot of processing (I’ll exclude the details of the processing for now), and I’m trying to figure out why it isn’t scaling with the number of workers.

  1. I timed my collate function and when I go from 4 to 8 to 16 workers the time each worker takes to execute the collate function is pretty much the same. But when I go from 16 to 24 workers, the time each worker takes starts to increase. I am not reading anything from disk in my dataloader, I am only reading from memory and performing computations. What are the possible reasons for why this is happening? Am I hitting my CPU memory bandwidth? Or my CPU IPS?

  2. In the case of 8 and 16 workers, even though each worker is taking the same amount of time, the time spent waiting for a batch in my main training loop is still the same despite having twice the number of workers. What is the possible explanation for this? Shouldn’t I have twice as many batches in memory in the same amount of time, meaning roughly half the waiting time to load a batch in the main training loop?

The node I’m running on has 96 logical CPUs, 48 physical cores and 500GB of DRAM, so I am not anywhere close to the limit on either of those fronts.