Hey all. I have some code which I’ve been training on my laptop, and the loss looks to go down steadily, indicating that the model is training. However, when I run the same code on a remote Ubuntu machine with a much better desktop GPU, the loss seems to stall out from the beginning. I expected around the same loss values/patterns on both machines, but for the desktop GPU one to be much faster. I’ve only trained for around 10 epochs so far, which is enough to see a major difference between the trainings on Windows vs. Ubuntu.
My question is, are there significant differences when transferring from one OS to the other with PyTorch code that I should be aware of? Do some packages behave differently? I currently use pathlib to ensure my filepaths are platform-agnostic, but other than that, I don’t really do anything special.
This is using the same pytorch version, right?
Did you try windows GPU? How does that compare against ubuntu GPU? What about windows CPU vs ubuntu CPU?
Also, things like cudnn and mkldnn matter, as they provide a lot of the conv algorithms used.
Yep, both machines are using PyTorch 1.1.0 through Python 3.7.3 with Anaconda. However, there is a difference in CUDA, my Windows laptop is using 10, whereas the Ubuntu desktop is using version 9. Could that be a large enough difference to cause this training issue?
With regards to comparing GPUs, the Windows GPU trains (albeit a lot slower because it’s an older laptop GPU), but the Ubuntu GPU doesn’t seem to train at all. I haven’t tried comparing CPU performance yet.
@SimonW So just to do some comparison, I tried running the transfer learning tutorial code on both my laptop and the Ubuntu desktop, and the code trains fine on both, indicating that the problem lies probably somewhere specifically with my code and not either machine.
Found my issue! My code had me generating a list of file paths for the raw image files as well as the masks, and I forgot that file ordering is different on Windows vs. Linux when using something like
os.listdir(). On Windows, the file orderings for both the raw images and the masks were the same, so I was able to match the images to masks correctly for processing, but on Linux, the raw images had a different file order than the masks, which meant that the wrong masks were being matched with the wrong raw images, causing the issue of no training.
Thank you for this, had the same problem!
It took me three days and I was unable to understand the issue I am having until I finally I land on this page and got a solution. Bundle of thanks.