Clearly different results on different machines

I implemented YOLOv3 using libtorch and trained with Telsa V100 on the server several times until the recall level was high enough. However, when I turned the model into a CPU version and saved it, I used the same code on my laptop and found that the recall was extremely low. Therefore, I designed a comparison experiment. Using the model, code and training set exactly the same as those in the laptop on the server, the running result display was quite different from the result on my laptop in recall. As is shown in

which is the result on server, running ubuntu with CPU mode, libtorch 1.5.0,
and the result on my laptop

running windows10 with also CPU mode, libtorch 1.5.0.

You can see the apparent difference in metrics.

This is a completely incomprehensible result, and I’ve managed to control the variables as much as possible, except for the difference between the two operating systems and the hardware systems.I don’t know if this is a normal phenomenon, but if there is such a huge difference between different systems, how do you transplant pre-training weights?

There might be several different issues, so let’s split them up.

  1. The pretrained model should give you approx. the same outputs up to the limited floating point precision. Especially while using different hardware, you might not be able to overcome the absolute errors introduces by the floating point precision. If you load the state_dict properly, set model.eval(), and still get largely different results, we would need to see the code to debug further.

  2. Training on different platform might also yield slightly different results due to the aforementioned reasons. However, it would be interesting to see how reliable the difference is. I.e. out of e.g. 10 runs on each platform with different seeds, how often does one model converge while the other diverges?

Thank you for your prompt reply.

I can certainly understand the floating point differences between platforms, but it seems that I have encountered more than I expected. I thought about it for a while, but I still didn’t know how to solve the problem. For example, I would eventually deploy to a laptop, so would I eventually have to train on a laptop? In my experience, this doesn’t make much sense.

On the other hand, you mentioned the need to see the code to debug further. so how exactly to do that?

No, as mentioned above, the pretrained model should give you the same outputs up the the floating point precision.

You could write an evluation script, which loads the model, uses a constant tensor, applies the forward pass, and prints the output.
Post this code here by wrapping it into three backticks ``` and also post the outputs from the Linux server and your laptop.

This is the code snippet I used to test:

device_type = torch::kCPU;
torch::Device device(device_type);

Yolo net(device);
auto train_set = VOCDataset(device).map(Collate{});//.map(torch::data::transforms::Stack<>());
auto train_size = train_set.size().value();
auto train_loader = torch::data::make_data_loader<torch::data::samplers::SequentialSampler>

torch::optim::Adam optimizer(net->parameters(), 
torch::load(net, train_opts.model_path);

cv::Mat origin_image = cv::imread("./imgs/dog.jpg"), resized_image;
cv::cvtColor(origin_image, resized_image, cv::COLOR_BGR2RGB);
resized_image = resizeKeepAspectRatio(origin_image, cv::Size(416, 416), { 128,128,128 });

cv::Mat img_float;
resized_image.convertTo(img_float, CV_32F, 1.0 / 255);

auto img_tensor = torch::from_blob(, { 1, 416, 416, 3 }).to(device);
img_tensor = img_tensor.permute({ 0,3,1,2 }).contiguous();
auto output = net->forward(img_tensor);

std::cout << non_max_suppression(output, 20, 0.1, 0.9);

Here is what I got on server:

   0.0000   66.9942  147.0321  173.0608  340.8369    0.1800    0.9806    0.0000
[ CPUFloatType{1,8} ]

while here is what I got on my laptop:

   0.0000  222.2365  106.0619  242.9099  132.0313    0.1007    0.9915    0.0000
   0.0000  136.0336  192.0411  292.1751  281.3036    0.1112    0.9894    0.0000
   0.0000   66.4557  148.0490  172.6351  341.0895    0.2453    0.9815    0.0000
[ CPUFloatType{3,8} ]

Assuming you are running this code snippet on both machines, I would recommend to check the intermediate tensors such as the output as well as the output activations using hooks.
If I’m not mistaken, forward hooks are not available in libtorch yet, so you could use the Python frontend.