Enable NUMA binding with torch.distributed.launch

hubertlu-tw · July 7, 2022, 10:37pm

For the distributed workloads without torch.distributed.launch API, we are able to manually spawn python processes and leverage CPU/GPU affinity by “numactl” to get better performance. For example, NVIDIA MLPerf SSD run script with bind_launch.py and the PyTorch tuning guide: CPU specific optimizations - Utilize Non-Uniform Memory Access (NUMA) Controls. However, how can we enable NUMA binding for the workloads with torch.distributed.launch API?

kwen2501 · July 11, 2022, 4:25pm

@d4l3k Wondering if you have any idea? Thanks!

GuWei007 · September 18, 2023, 8:11am

Encountered the same problem when training GPT 18.4B, is there any standard solution?
@hubertlu-tw @kwen2501