Why same model in CUDA and CPU got different result?

martinr · October 2, 2019, 2:11pm

You are not the first one to have such problem. I have a similar one here and there is another unanswered one here.

I suggest you try to locate the source of divergence by yourself first, it makes it easier to help you. I had a Python code, not C++, but I’ll share it here so you get the idea on how to locate the problem.

Save off the intermediate variables on CPU and GPU inference:
torch.save(variable, "/path/to/varfile")
then afterwards load both for analysis:

cpuvar = torch.load("/path/to/varfile_cpu", map_location="cpu")
gpuvar = torch.load("/path/to/varfile_gpu", map_location="cpu")

compare:

close = torch.isclose(cpuvar, gpuvar, rtol=1e-04, atol=1e-04)
print("SIMILAR", close[close==True].shape)
print("FAR", close[close==False].shape)

Perfect case is where CPU and GPU will have similar results for the same input. Compare all variables until you will find the divergence.