nn.DataParallel fails with AMD CPU + 2 1080Ti

chuong_nguyen · September 29, 2018, 6:39am

I have two computers, which installed Pytorch with Conda. :

One with Intel CPU + 2 TitanX. (Say this is Intel Conputer)
One with AMD threadripper + 2 1080Ti (Say this is AMD computer).

I tested the same code, with the same model, on the same dataset, but the results from two computer are different.
Let say if I run with a batch 512, then only the first haft (256 results) are matching between two computers, while the latter haft are different. Here is the screen shot of the output of the first(idx=0) and the last(idx=-1) samples in a batch:

The results on the left are from Intel Computer (which are correct), and the AMD on the right.

Does anyone have the same problem, and know how to solve it?
Thanks in advance.

albanD · September 29, 2018, 12:31pm

Hi,

Does that prevent the model from converging.
I would not expect any learning to give the same results for such different hardwares.
Keep in mind that adding two float numbers can give you different answers depending on the hardware (and both will be correct with respect to the standard).
Nothing makes the answer on one machine more correct than the other, they both are.

Now if that prevent the model from converging, I would first try different random seeds on the original machine to make sure that the model is actually stable and converges all the time, not just for that one random seed.

chuong_nguyen · September 29, 2018, 11:46pm

Hi,

So, the results above are obtained from Testing step, where the model weights are already trained. But even though, only the first half of the batch (for all batches) yields identical results ob both AMD and Intel computers, while the second half is very different.
When training on AMD machine with Parallel option, it never converges, no matter what learning rate I set.

Thanks.