Enable NUMA binding with torch.distributed.launch

For the distributed workloads without torch.distributed.launch API, we are able to manually spawn python processes and leverage CPU/GPU affinity by “numactl” to get better performance. For example, NVIDIA MLPerf SSD run script with bind_launch.py and the PyTorch tuning guide: CPU specific optimizations - Utilize Non-Uniform Memory Access (NUMA) Controls. However, how can we enable NUMA binding for the workloads with torch.distributed.launch API?

@d4l3k Wondering if you have any idea? Thanks!