Dataloader num_workers relate to gpu memory?

May I ask how to test the batch loading speed & the model training iteration speed?

Also, if I set num_workers = multiprocessing.cpu_count() which maximize the usage of cpu, and still does not improve runtime, does that mean there’s no way to improve runtime?

New to Pytorch for these silly question lol