Getting different results on each run with libtorch 1.6

John_J_Watson · March 30, 2022, 12:21pm

I traced a model and I am using it within libtorch, but for some bizzare reason, I notice that the results are different at each run I am totally baffled by this.

So, I load the model like so:

  torch::jit::script::Module module_af;
  module_af = torch::jit::load(af_model_path);
  module_af.eval();
  // std::cout << "Model Load ok\n";
  filelog.get()->info("Model Load ok");

and I run the inference like so:

    Mat img4encodingRGB = imread(allign_filename, cv::COLOR_BGR2RGB);
    auto img2encode = torch::from_blob(img4encodingRGB.data, {img4encodingRGB.rows, img4encodingRGB.cols, img4encodingRGB.channels()}, at::kByte);
  
    img2encode = img2encode.to(at::kFloat).div(255).unsqueeze(0);
    img2encode = img2encode.permute({ 0, 3, 1, 2 });
    img2encode.sub_(0.5).div_(0.5);

I run the forward like so:

    std::vector<torch::jit::IValue> arcface_inputs;
    arcface_inputs.push_back(img2encode);
    at::Tensor embeds0 = module_af.forward(arcface_inputs).toTensor();

    std::cout << embeds0; // GIves different output on each run.

I am really baffled by this. The problem seems to be even worse - on two machines, they seem to produce identical results on concecutive runs, but on two other machines they dont. All packages are EXACTLY the same - libtorch 1.6 and the above code compiled using cmake.

It kinda reminds me of undefined behaviour, but I am totally lost because on two machines (a server and a vm) they seem to produce identical results - but they dont on other two.

I have triple checked everything to see if I am doing something stupid, but it does not seem like it - and hence my post.

Hope someone can point me to clues as to what I could be doing wrong

Example output run:1:

Columns 1 to 10-0.1005 -0.1768 -0.2082  0.1240  0.1185  0.3801  0.1378  0.1269 -0.3572 -1.1453

run:2:

Columns 1 to 10-0.1861 -0.3326 -0.3739  0.2302  0.1730  0.5391  0.1965  0.1972 -0.5481 -1.7317

ptrblck · March 31, 2022, 6:29am

Could you check if a static input (e.g. torch::ones) would also yield different results, please?

John_J_Watson · March 31, 2022, 10:38am

@ptrblck thank you for this suggestion (not sure why I did not think about it).

So, I ran with ones like so:

    std::vector<torch::jit::IValue> inputs;
    at::Tensor input = torch::ones({1, 3, 112, 112});
    inputs.push_back(input);
    at::Tensor embeds0 = module_af.forward(inputs).toTensor();
    // at::Tensor embeds0 = module_af.forward({img2encode}).toTensor();

    std::cout << embeds0;

The result is that on the laptops (consumer grade) the results vary with every run - BUT, when I run the EXACT same code on a server VM, I get repeatable results. This is so baffling. I am very much inclined to think this is something to do with the hardware.

I forgot to mention, I had compiled libtorch 1.6 from source for CPU (I had edited the makefile).

ptrblck · March 31, 2022, 6:52pm

Thanks for the update!
Would it be possible to rebuild PyTorch from the current master branch in case you are hitting a known and already fixed issue?
If that also yields large numerical results, you might be hitting indeed a (hardware-specific) bug and I’m happy to try help debugging it.

John_J_Watson · April 4, 2022, 2:57pm

@ptrblck Thank you VERY much for your suggestions. I am now rebuilding the master branch (11.0) - I will let you know how this goes. Another update is that I traced another model (thinking it could be the model itself)… but this also behaves the same way on my consumer grade h/ware but returns consistent results on server. How bizzare. Even more bizzare is that I have another model I load within the same pgm and this always returns consistent results everywhere.

BTW: would it help if I send you the model at all? just wondered.

best

ptrblck · April 4, 2022, 6:19pm

Yes, the model definition would be helpful to have so that I can export it and try to reproduce the issue.

John_J_Watson · April 4, 2022, 7:09pm

@ptrblck thank you for helping me out on this (still totally baffled by it all and not able to make any progress )

The jit traced model is here:

https://drive.google.com/file/d/1AbUPU9awdcljYZigXJbYjoA8Hh2NiMfx/view?usp=sharing

the model accepts a 1,3,112,112 input and produces a 1,512 output.

The original mode, from which the trace was produced using jit.trace is here:

https://drive.google.com/file/d/1iz9rIXMtZDQTgl95BEshr7WIzAR7li12/view?usp=sharing

Iwill meanwhile try and build the master…

John_J_Watson · April 4, 2022, 8:31pm

@ptrblck so I built the master and ran it again… with all ones like so:

  module_af = torch::jit::load(af_model_path);

    std::vector<torch::jit::IValue> inputs;
    at::Tensor input = torch::ones({1, 3, 112, 112});
    inputs.push_back(input);
    at::Tensor embeds0 = module_af.forward(inputs).toTensor();

Run 1, I get (just the head of the embeds0):

Run 2:

Run 3:

I find the outputs truly bizzare. It even seems like I am maybe doing something very stupid, but I just cannot see how the model can output different numbers with ones as inputs. When the model was jit traced, the model was in eval mode.

John_J_Watson · April 8, 2022, 10:51am

@ptrblck I was wondering/VERY curious if you had managed to reproduce the issue? Thank you for all your help with this.