My project runs fast on my workstation at around 100% GPU utilization on an RTX 3090 but very slow on a server machine with an H100 and many CPU cores.
The code simulates data, so I don’t think it is related to reading/write to/from SSD. I noticed that no matter how many workers I set on the cluster, 2 threads are at 100% utilization, and all workers are almost idle. If I set 64 workers, the GPU waits for CPU, goes through 64 batches at ~100% utilization and then waits again. I used this high number to make it easier to see.

I wonder what this could be. Could it be related to some OMP stuff? I tried setting OMP_NUM_THREADS=1 without luck. pin_memory defaults to False.

Please find a screenshot of the utilization on the cluster attached.
The environment is

“Fixed” by Fix processor affinity for fork child by jjyyxx · Pull Request #1389 · pytorch/builder · GitHub

