Getting different results on each run with libtorch 1.6

I traced a model and I am using it within libtorch, but for some bizzare reason, I notice that the results are different at each run :frowning: I am totally baffled by this.

So, I load the model like so:

  torch::jit::script::Module module_af;
  module_af = torch::jit::load(af_model_path);
  module_af.eval();
  // std::cout << "Model Load ok\n";
  filelog.get()->info("Model Load ok");

and I run the inference like so:

    Mat img4encodingRGB = imread(allign_filename, cv::COLOR_BGR2RGB);
    auto img2encode = torch::from_blob(img4encodingRGB.data, {img4encodingRGB.rows, img4encodingRGB.cols, img4encodingRGB.channels()}, at::kByte);
  
    img2encode = img2encode.to(at::kFloat).div(255).unsqueeze(0);
    img2encode = img2encode.permute({ 0, 3, 1, 2 });
    img2encode.sub_(0.5).div_(0.5);

I run the forward like so:

    std::vector<torch::jit::IValue> arcface_inputs;
    arcface_inputs.push_back(img2encode);
    at::Tensor embeds0 = module_af.forward(arcface_inputs).toTensor();

    std::cout << embeds0; // GIves different output on each run.

I am really baffled by this. The problem seems to be even worse - on two machines, they seem to produce identical results on concecutive runs, but on two other machines they dont. All packages are EXACTLY the same - libtorch 1.6 and the above code compiled using cmake.

It kinda reminds me of undefined behaviour, but I am totally lost because on two machines (a server and a vm) they seem to produce identical results - but they dont on other two.

I have triple checked everything to see if I am doing something stupid, but it does not seem like it - and hence my post.

Hope someone can point me to clues as to what I could be doing wrong :sob:

Example output run:1:

Columns 1 to 10-0.1005 -0.1768 -0.2082  0.1240  0.1185  0.3801  0.1378  0.1269 -0.3572 -1.1453

run:2:

Columns 1 to 10-0.1861 -0.3326 -0.3739  0.2302  0.1730  0.5391  0.1965  0.1972 -0.5481 -1.7317

Could you check if a static input (e.g. torch::ones) would also yield different results, please?

1 Like

@ptrblck thank you for this suggestion (not sure why I did not think about it).

So, I ran with ones like so:

    std::vector<torch::jit::IValue> inputs;
    at::Tensor input = torch::ones({1, 3, 112, 112});
    inputs.push_back(input);
    at::Tensor embeds0 = module_af.forward(inputs).toTensor();
    // at::Tensor embeds0 = module_af.forward({img2encode}).toTensor();

    std::cout << embeds0;

The result is that on the laptops (consumer grade) the results vary with every run - BUT, when I run the EXACT same code on a server VM, I get repeatable results. This is so baffling. I am very much inclined to think this is something to do with the hardware.

I forgot to mention, I had compiled libtorch 1.6 from source for CPU (I had edited the makefile).

Thanks for the update!
Would it be possible to rebuild PyTorch from the current master branch in case you are hitting a known and already fixed issue?
If that also yields large numerical results, you might be hitting indeed a (hardware-specific) bug and I’m happy to try help debugging it.

1 Like

@ptrblck Thank you VERY much for your suggestions. I am now rebuilding the master branch (11.0) - I will let you know how this goes. Another update is that I traced another model (thinking it could be the model itself)… but this also behaves the same way on my consumer grade h/ware but returns consistent results on server. How bizzare. Even more bizzare is that I have another model I load within the same pgm and this always returns consistent results everywhere.

BTW: would it help if I send you the model at all? just wondered.

best

Yes, the model definition would be helpful to have so that I can export it and try to reproduce the issue.

1 Like

@ptrblck thank you for helping me out on this (still totally baffled by it all and not able to make any progress :sob: )

The jit traced model is here:

https://drive.google.com/file/d/1AbUPU9awdcljYZigXJbYjoA8Hh2NiMfx/view?usp=sharing

the model accepts a 1,3,112,112 input and produces a 1,512 output.

The original mode, from which the trace was produced using jit.trace is here:

https://drive.google.com/file/d/1iz9rIXMtZDQTgl95BEshr7WIzAR7li12/view?usp=sharing

Iwill meanwhile try and build the master…

@ptrblck so I built the master and ran it again… with all ones like so:

  module_af = torch::jit::load(af_model_path);

    std::vector<torch::jit::IValue> inputs;
    at::Tensor input = torch::ones({1, 3, 112, 112});
    inputs.push_back(input);
    at::Tensor embeds0 = module_af.forward(inputs).toTensor();

Run 1, I get (just the head of the embeds0):

-0.132933
0.00335889
-0.283859
-0.127157
-0.0562173
-0.0433259
-0.0634281
-0.0114401
-0.0847342
-0.0239286

Run 2:

-0.194605
0.00335889
-0.283859
-0.127157
-0.0562173
-0.0433259
-0.0634281
-0.0114401
-0.0847342
-0.0239286

Run 3:

-0.194605
0.00335889
-0.283859
-0.127157
-0.0320272
-0.029183
-0.0418263
-0.0157832
-0.0434114
-0.015427

I find the outputs truly bizzare. It even seems like I am maybe doing something very stupid, but I just cannot see how the model can output different numbers with ones as inputs. When the model was jit traced, the model was in eval mode.

@ptrblck I was wondering/VERY curious if you had managed to reproduce the issue? Thank you for all your help with this.