Training Hogwild with SGD parallel process stalls in CPU

I am training the MNIST dataset with HOGWILD Async SGD as per your example. I am using a CPU with 64 cores and a GPU A100.

When running the program for CPU, the more processors I use the more time takes to run. I have tried with 3,4,5,6,…and more and the process stalls. As per the figure you can appreciate that when running the program with 6 processes the processing time becomes “infinite” (represented here by 999).I can attest that the number of processes translates to the number of cores used to run the program, because when I monitor the cpu usage I can see that the cores that are busy equals the number of processes specified in the run. I thought that the more processors used the faster the training would be, but is not the case, at first I thought that the communication increased and could be the culprit, but I am not sure, do you know why? Is this a bug, or I am missing something?

Also, for the case of running the program with --cuda, what is the significance of number of processes?, if there is only one GPU available? What I observed is that even though the process does not stall, the processing time grows with the increased number of processors as well.

Thanks for your help.

Can someone from Pytorch reply to this post please?