Higher CPU usage with DataLoader num_workers=0

I was training AlexNet on the ImageNet dataset and decided to vary the num_workers argument of the Dataloader, to see the impact it had. When analyzing the CPU usage, I found that the usage is higher with num_workers=0 than with num_workers set to 2, 4, or 8 (the results are presented in the graph). I know the difference is not much, but is there a reason for this to happen?
I don’t know if it’s important, but the experiments were run with pin_memory=True.

Note: The results shown in the graph are the average of 5 runs

How did you plot the CPU usage? Is is the peak usage of a single core or some kind of mean values of all cores? A usage of 6% also seems to be quite low.

I used dstat to collect the CPU usage. The values presented in the plot correspond to the system CPU + user CPU. The experiments were performed on a machine with a 2.40 GHz and 20 cores CPU, I think that is the reason why the usage is so low.

Thanks for the information. I just tested a quick resnet training on ImageNet using 5 workers vs. 0 workers on an AMD EPYC 7742 64-Core Processor and see that the 5 workers use:

--total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw
 12   1  86   1   0|   0    32k|  31M 2301k|   0     0 |  81k   86k
[...]

while no additional workers (loading in the main process) uses:

 3   0  97   0   0|   0    56k| 259k  303k|   0     0 |  13k   14k
[...]

Ok, so the problem must be on my setup. Did you run with pin_memory=True? Additionally, I was using torch.nn.DataParallel to train with 4 GPUs, and the dataset was stored in torch.Tensors with the images already pre-processed, to avoid having to do the pre-processing during training. Do you think any of these factors may have caused the differences in CPU usage?

Thank you very much for your help!

Yes, I used pin_memory=True and all 8x GA100s on this node. Also I loaded and augmented the ImageNet data on-the-fly.
The difference would be that I’ve used DistributedDataParallel with a single process per GPU, while you are using nn.DataParallel, which might cause a different CPU usage (unsure, as I haven’t verified it).
Could you try to use DDP and check, if the usage is still the same (or remove DataParallel)?

Thanks for the information you provided. Unfortunately, I no longer have access to the infrastructure where I ran the experiments, so I cannot see what happens with DDP. However, do you have an idea why nn.DataParallel may be causing the differences in CPU usage?

No, I don’t have a good idea what might be causing it. However, based on the initial description of the issue, it would be interesting to see the CPU workload for the DataLoader only in your setup to check, if we see any unexpected behavior there.