For the distributed workloads without torch.distributed.launch API, we are able to manually spawn python processes and leverage CPU/GPU affinity by “numactl” to get better performance. For example, NVIDIA MLPerf SSD run script with bind_launch.py and the PyTorch tuning guide: CPU specific optimizations - Utilize Non-Uniform Memory Access (NUMA) Controls. However, how can we enable NUMA binding for the workloads with torch.distributed.launch API?
Encountered the same problem when training GPT 18.4B, is there any standard solution?
@hubertlu-tw @kwen2501