Why same model in CUDA and CPU got different result?

I tested a model with different results in the case of CUDA and CPU.
The condition of the CPU can produce results normally.
But in the case of CUDA, I have to convert the output to the CPU before I can execute it, but the result of the execution is incorrect.

		torch::Tensor score;
		if (isCUDA) {
			torch::Tensor new_out_tensor = out_tensor.to(torch::kCPU).detach();
			score = new_out_tensor.squeeze(0);// [288, 384, 2]
		else {
			score = out_tensor.squeeze(0).detach();// [288, 384, 2]

If there is no conversion to the CPU, a memory exception will occur.

Can I do something wrong there?

CPU result:

CUDA result:

1 Like


Why do you detach() when you compute your scores? That would prevent gradients from flowing if you use this during training.

1 Like

Actually, I don’t know why I want to use it. It’s just from the Internet.
Because I am not using it for training, I just use it to predict, so the result of the implementation is no problem.
The part with detach() is that the CPU can execute normally without problems.

1 Like

If you just use this for evaluation, you should use the NoGradGuard (not sure about the exact name, but something along those lines) to completely disable the autograd (that will make you code faster and use less memory) !

I guess there is something wrong in the way you send the model on the gpu. Are you sure that the weights / inputs to the network are properly sent to the gpu?

1 Like

This is load model code:

void CRAFT::loadModel(string& model_path,bool& isRuntimeIn) {
	isRuntime = isRuntimeIn;
	if (isCUDA) {
		module = torch::jit::load(model_path, torch::kCUDA);
	else {
		module = torch::jit::load(model_path);

	assert(module != nullptr);
	std::cout << "ok\n";

As you can see, I switched the CUDA to the CPU in the same program, so all the method and process is the same, I don’t know why the result is different?

1 Like

Added NoGradGuard then same CUDA result is different.

torch::NoGradGuard no_grad_guard;

My forward code:

	if (isCUDA) {
		output = module.forward({ tensor_image.to(torch::kCUDA)});
	else {
		output = module.forward({ tensor_image });
1 Like

You should be able to print to stdout the values of the different tensors (both cpu and gpu), can you print the inputs, weights and outputs to see where the difference appears?

1 Like

I will back next Thus.
Then continues to discuss…

1 Like

I try to write output to file.
Store on my google drive.
link to:
GPU and CPU got different result. Really…

1 Like

As I said above, you should inspect when the discrepancy appears. It is when you load the weights? Or on the inputs or when you apply the forward pass?

1 Like

Ok, I will try it out.
But I want to put my model and code on github, if you can, please try it out (please pay attention to the operation instructions, choose 1 for CUDA, 2 for CPU).

You should know it, it seems very strange.

Because the model is not small, please note that if you use the GPU, the GPU RAM should not be too small, otherwise it will be destroyed.

1 Like

Hello, the classifier trained by Resnet50 works well in python. The same test data has poor classification accuracy in C++(libtorch) calls. I don’t know why. Have you met it? thank you!

Haha . not yet.
May be later…
Hope libtorch get better…

Did you make sure to use the same preprocessing etc.?
I would recommend to check the inputs first and get the same values in the Python API and libtorch and then try to narrow down the discrepancy.

I have upload my code and model here:

that is in same process in one code just switch gpu and cpu.

You are not the first one to have such problem. I have a similar one here and there is another unanswered one here.

I suggest you try to locate the source of divergence by yourself first, it makes it easier to help you. I had a Python code, not C++, but I’ll share it here so you get the idea on how to locate the problem.

  1. Save off the intermediate variables on CPU and GPU inference:
    torch.save(variable, "/path/to/varfile")
  2. then afterwards load both for analysis:
cpuvar = torch.load("/path/to/varfile_cpu", map_location="cpu")
gpuvar = torch.load("/path/to/varfile_gpu", map_location="cpu")
  1. compare:
close = torch.isclose(cpuvar, gpuvar, rtol=1e-04, atol=1e-04)
print("SIMILAR", close[close==True].shape)
print("FAR", close[close==False].shape)

Perfect case is where CPU and GPU will have similar results for the same input. Compare all variables until you will find the divergence.

I will try it.
But It’s very strange, I can use python pytorch on the GPU (cuda 10) to perform normally.

Thank you, I will try it first.

Why CRAFT behaves so strange, I didn’t met this problem with other models, Have you figured it out now?