I trained a segmentation model in Pytorch and tested it to give an F1 score of 0.93 on my local computer (Windows, conda, CUDA 10.2, Pytorch 1.2). However, the F1 score dropped to 0.3 when testing on a Linux server (conda, CUDA 9.0, Pytorch 1.1). I double checked that both sets of code, label files, test sets were the same, and there is no “explicit” random sampling in my code (even if so, the effect shouldn’t be so drastic).
I tried the following to solve my problem:
I suspected that it might have something to do with Pytorch versions, so installed Pytorch 1.1 on my Windows machine to match the server’s, but got the same 0.93 score. I then thought it had to do with CUDA itself, so eliminated the GPU factor altogether and did inference on cpu on server, I still got 0.3 on server.
What could possibly be causing this huge discrepancy?
Could you check the output of a single input and post the max absolute difference between the predictions?
If this difference is already high, the next step would be to bisect the model and check each layer’s output.
Do you mean the pixelwise max absolute difference of a predicted mask? I’m doing binary segmentation, and treat predictions as positive if len(mask_pred[mask_pred > 0.5]) > 0.
Do the provided values represent the difference between both models?
I’m not sure I understand the last post correctly.
What was same and was differs now?
Oh I meant that 37.9 is the max difference on both machines, before sigmoid. 0.99 is the max difference on both machines after sigmoid. Although the overall max difference is the same, I noticed that the difference of individual samples are different across machines.
Thanks I’ll do that and post an update! Do you think it’s a Pytorch version issue? Maybe the code for some operations changed moving from 1.1.0 to 1.2.0?
Yes I forgot about that, I did test out different versions. It’s just that the Linux server uses CUDA 9.0 and 1.1.0 is the highest supported version if I’m not mistaken. I’m using python 3.7 for both (3.7.3 linux, 3.7.6 windows).