What can cause differences in training a model in Windows vs. Ubuntu?

Yuerno · June 17, 2019, 5:49pm

Hey all. I have some code which I’ve been training on my laptop, and the loss looks to go down steadily, indicating that the model is training. However, when I run the same code on a remote Ubuntu machine with a much better desktop GPU, the loss seems to stall out from the beginning. I expected around the same loss values/patterns on both machines, but for the desktop GPU one to be much faster. I’ve only trained for around 10 epochs so far, which is enough to see a major difference between the trainings on Windows vs. Ubuntu.

My question is, are there significant differences when transferring from one OS to the other with PyTorch code that I should be aware of? Do some packages behave differently? I currently use pathlib to ensure my filepaths are platform-agnostic, but other than that, I don’t really do anything special.

SimonW · June 17, 2019, 6:22pm

This is using the same pytorch version, right?

Did you try windows GPU? How does that compare against ubuntu GPU? What about windows CPU vs ubuntu CPU?

Also, things like cudnn and mkldnn matter, as they provide a lot of the conv algorithms used.

Yuerno · June 17, 2019, 6:32pm

Yep, both machines are using PyTorch 1.1.0 through Python 3.7.3 with Anaconda. However, there is a difference in CUDA, my Windows laptop is using 10, whereas the Ubuntu desktop is using version 9. Could that be a large enough difference to cause this training issue?

With regards to comparing GPUs, the Windows GPU trains (albeit a lot slower because it’s an older laptop GPU), but the Ubuntu GPU doesn’t seem to train at all. I haven’t tried comparing CPU performance yet.

Yuerno · June 17, 2019, 6:46pm

@SimonW So just to do some comparison, I tried running the transfer learning tutorial code on both my laptop and the Ubuntu desktop, and the code trains fine on both, indicating that the problem lies probably somewhere specifically with my code and not either machine.

Yuerno · June 17, 2019, 7:50pm

Found my issue! My code had me generating a list of file paths for the raw image files as well as the masks, and I forgot that file ordering is different on Windows vs. Linux when using something like os.listdir(). On Windows, the file orderings for both the raw images and the masks were the same, so I was able to match the images to masks correctly for processing, but on Linux, the raw images had a different file order than the masks, which meant that the wrong masks were being matched with the wrong raw images, causing the issue of no training.

SimonW · June 17, 2019, 8:36pm

glad you figured out!

Andrej_Hafner · January 23, 2020, 2:46pm

Thank you for this, had the same problem!

Mughees · August 12, 2020, 1:56pm

It took me three days and I was unable to understand the issue I am having until I finally I land on this page and got a solution. Bundle of thanks.