Drastically different inference results on different machines?

I trained a segmentation model in Pytorch and tested it to give an F1 score of 0.93 on my local computer (Windows, conda, CUDA 10.2, Pytorch 1.2). However, the F1 score dropped to 0.3 when testing on a Linux server (conda, CUDA 9.0, Pytorch 1.1). I double checked that both sets of code, label files, test sets were the same, and there is no “explicit” random sampling in my code (even if so, the effect shouldn’t be so drastic).

I tried the following to solve my problem:
I suspected that it might have something to do with Pytorch versions, so installed Pytorch 1.1 on my Windows machine to match the server’s, but got the same 0.93 score. I then thought it had to do with CUDA itself, so eliminated the GPU factor altogether and did inference on cpu on server, I still got 0.3 on server.

What could possibly be causing this huge discrepancy?

Are you are setting the model to evaluation using model.eval() ?

Yes, and both codes are the same

Could you check the output of a single input and post the max absolute difference between the predictions?
If this difference is already high, the next step would be to bisect the model and check each layer’s output.

Do you mean the pixelwise max absolute difference of a predicted mask? I’m doing binary segmentation, and treat predictions as positive if len(mask_pred[mask_pred > 0.5]) > 0.

Yes, I meant the output of the model.
Your code would yield a positive prediction, if at least one pixel belongs to class1, is that correct?

I checked both the maximum (across all samples) difference before and after sigmoid on the 2 machines and they were the same!

Before: 37.93775939941406
After: 0.9999369382858276

The maximums & ranges for individual samples do differ though!

Do the provided values represent the difference between both models?
I’m not sure I understand the last post correctly.
What was same and was differs now?

Oh I meant that 37.9 is the max difference on both machines, before sigmoid. 0.99 is the max difference on both machines after sigmoid. Although the overall max difference is the same, I noticed that the difference of individual samples are different across machines.

Ah OK, thanks for the clarification.
In that case, I would recommend to check each layer’s output and narrow down, where this difference is created.

Thanks I’ll do that and post an update! Do you think it’s a Pytorch version issue? Maybe the code for some operations changed moving from 1.1.0 to 1.2.0?

You mentioned you’ve installed the same PyTorch versions and use the CPU now?
Make sure to use the same versions before staring to debug.

Also, I would recommend to use the lastest stable version (1.4) as well as Python3, as we’ve had a weird issue with Python2.7 recently (link).

Yes I forgot about that, I did test out different versions. It’s just that the Linux server uses CUDA 9.0 and 1.1.0 is the highest supported version if I’m not mistaken. I’m using python 3.7 for both (3.7.3 linux, 3.7.6 windows).

If you are using the binaries, they will ship with their own CUDA and cudnn libs.
You would only have to provide an NVIDIA driver.